Research » Web Spam Detection » Datasets » WEBSPAM-UK2007 » Graph of UK-2007

Hosts and hostgraph (114K hosts)

The hostgraph summarizes the URL to URL links by converting multiple links among pages in different hosts into a single (weighted) link among two hosts. The collection contains 114,529 different hosts, listed in the following file and numbered from 0 to 114,528:

The graph is formatted as follows: the first line contains the number of hosts (114,529). The second line contains the out-links of host 0, the third line the out-links of host 1, and so on. Each line is of the form "dest1:nlinks1 dest2:nlinks2, ..., destk:nlinksk", in which dest is the destination host id, and nlinks the number of page-to-page links between the two hosts.

URLs and webgraph (105M URLs)

The Web graph of this collection consist of 105,896,555 million nodes representing pages, connected by approximately 3.7 billion edges representing hyperlinks. The file with the URLs contains one URL per line, starting at URL number 0 and ending at URL number 105,896,554. The URLs are sorted lexicographically, to increase the compression ratio when using the Boldi-Vigna (BV) compression technique. Note that the first URL is identified with the number 0.

The graph is provided as a compressed adjacency list, using the BV format. A set of Java classes is provided to access the data, you can download the code from the WebGraph Framework site. You can also try the (alpha-stage) C++ port of the WebGraph Framework.

The following three files must be downloaded and placed under the same directory, and then uncompressed to be used:

... as plain text

If you want the full graph in plain text format, you can use the class BV2Ascii.java to convert the BV graph, but we strongly recommend to work with the graph in compressed form, as it is much faster to read.

For inquiries contact Carlos Castillo