k²-tree

The k²-tree is a Web graph representation based on a compact tree structure that takes advantage of large empty areas of the adjacency matrix of the graph. The representation not only retrieves successors and predecessors in symmetric fashion, but also it is particularly efficient to check for specific links between nodes, or between ranges of nodes, or to list the links between ranges. The representation is also of general interest and can be used to compress other kinds of graphs and data structures.

The original idea was published in SPIRE'09. More details can be found in Susana Ladra's PhD thesis. There is also a journal article, published in Information Systems, which contains the latest version of the approach.

In this page we provide two versions of the source code for the k²-tree. They are implemented in C. The hybrid approach corresponds to the version used in the journal article, with different k values and compression of leaf submatrices. The compression generates four different files (.tr, .lv, .cil, .voc) that composed the final compressed representation. The basic approach uses with k=2 for all the levels, only uses one bitmap for representing the whole tree, and the construction process diverges (uses a counting-sort strategy to compute the bitmaps for each level). In this case only a output file is generated (.kt).

Basic k²-tree: source code.
Hybrid k²-tree, with vocabulary of leaves: source code.

Usage

To compress a graph, you have to run the command

./build_tree <GRAPH> <name> <K1> <K2> [<max level K1>] <S>

and then

./compress_leaves <name> <hash size>

(where the second parameter must be established accordingly to the size and properties of the graph and the RAM size of the machine. For instance, a possible value can be 4000000)

<GRAPH> is the file containing the graph in binary format. This format consists in one integer with the number of nodes, one double with the number of edges, and then the adjacency list of each node, where we mark with a negative number the start of the next list. That is, we write -1, then the neighbors of the first node, then -2, the list of neighbors of the second node, and so on.

For instance, the first integers of the cnr-2000 graph in binary format will be: 325557 3216152 0 - 1 2 343 344 345 346 347 348 349 350 351 352 211285 223143 -2 3 4 5 320 -3 211285 223143 - 4 -5 318 -6 -7 118 219 297 -8 7 19 219 286 297 -9 7

While the numbers of this file start in 1, my implementation consider the numbering starting in 0. Hence, if I ask for the successor list of node 6, it will return 117, 218,296 . The successor list of node 5 is null.

Parameters:

<name> is the basename of the output files. This process will generate 4 files: name.tr, name.lv, name.cil, name.voc, and that is the compressed representation of the graph (it already includes extra structures for rank over bits, which permits random navigation over the graph, hence, that is practically the main memory used. It is necessary the extra array to decompress). This construction strategy generates a temporary file (a huge file) called name.il that can be erased after using ./compress_leaves
<K1> <K2> are the k values for the hybrid approach, you must set them to 4 and 2 respectively.
<S> is used to speed-up the construction (or to allow build bigger graphs on machines with few RAM). You can set this parameter to 22 if the graph is large.

Once the compressed graph is built, you can operate with it (use_tree).

The original binary graph can be rebuild with ./rebuild_tree

You can also rebuild the graph using ./checkrebuild_tree (it is extremely slow, because it verifies if each node is connected with the rest to rebuild the graph, but it can be used to test this operation) or ./rangerebuild_tree (the same, but using range queries to rebuild the graph. Much faster than checking individual links). You can also rebuild the transpose graph with ./invrebuild_tree or test the speed of successors and predecessors with ./test_tree or ./revtest_tree respectively (you must pass as parameter a binary file where the first integer indicates the number of queries to process and the following ones indicate the id of the node for which you want to retrieve the successors or the predecessors).

Notice that for k²-tree queries node ids range from 0 to n-1.

Acknowledgements

If you use this source code, please quote the journal article as follows:


@ARTICLE {
        BLNis13.2,
	AUTHOR = "N. Brisaboa and S. Ladra and G. Navarro",
	TITLE = "Compact Representation of Web Graphs with Extended 
		Functionality",
        JOURNAL = {Information Systems},
	VOLUME = 39,
	NUMBER = 1,
	PAGES = "152--174",
        YEAR = 2014
        }

About us

We are two researchers from the University of A Coruña (Spain) and one from the University of Chile (Chile).

Nieves R. Brisaboa (phD). Database Lab. UDC
Susana Ladra (phD). Database Lab. UDC
Gonzalo Navarro (phD). Computer Science Department. UChile

This work was supported in part by MICINN (PGE and FEDER) grants TIN2009-14560-C03-02 and Xunta de Galicia (co-funded with FEDER) grants 2010/17 and CN 2012/211, and by Millennium Nucleus Information and Coordination in Networks ICM/FIC P10-024F.

k2-tree

k²-tree