Research » Web Spam Detection » Datasets » WEBSPAM-UK2006 » Links

URLs and hyperlinks of UK-2006

Hostgraph (host to host links)

The hostgraph summarizes the URL to URL links by converting multiple links among pages in different hosts into a single (weighted) link among two hosts. The collection contains 11,402 different hosts, listed in the following file and numbered from 0 to 11,401:

The graph is formatted as follows: "src -> dest1:nlinks1 dest2:nlinks2, ..., destk:nlinksk", in which src is the source host, dest is the destination host, and nlinks the number of page-to-page links between the two hosts.

Web graph (URL to URL links)

The Web graph of this collection consist of roughly 77 million nodes representing pages, connected by approximately 3 billion edges representing hyperlinks. The URLs of the pages are provided in a text file compressed with gzip:

The graph is provided as a compressed adjacency list, using the BV format. A set of Java classes is provided to access the data, you can download the code from the WebGraph Framework site. You can also try the (alpha-stage) C++ port of the WebGraph Framework.

The following two files must be downloaded and placed under the same directory:

... as plain text

If you want the full graph in plain text format, you can use the class BV2Ascii.java to convert the BV graph, but we strongly recommend to work with the graph in compressed form, as it is much faster to read.

Change history

October 14, 2011: updated to host locally the files of the graph from this address: http://law.di.unimi.it/webdata/uk-2006-05/. No "offsets" file is included, you can generate it with java it.unimi.di.webgraph.BVGraph -O uk-2006-05-nat.

December 1, 2006: updated to remove spurious self-loops (all nodes with out-degree zero had an artificial self-loop inserted in the graph by mistake, those artificial self-loops were removed). If this is important for your algorithms, you should download this graph again.

October 23, 2006: updated to be compatible with the text contents; the changes were quite minor but you must use this version of the graph to be consistent with the text repository. The old graph is still available for reference purposes.

For inquiries contact Carlos Castillo