Research » Web Spam Detection » Datasets » WEBSPAM-UK2007 » HTML Contents

Getting the contents of WEBSPAM-UK2007

The contents of UK-2007 are provided in WARC format, a standard proposed by The Internet Archive. You can see an example of how the data looks like. LAW library 1.3+ includes methods for reading these records from Java.

We are providing a summary version of the data:

Full versionSummary version
ContentsAll pages:
105M pages
Up to 400 pages per host:
12M pages
Size560 GB compressed46 GB compressed
200 GB uncompressed
Physical format8 files of ~70 GB each
(not available)
8 files of ~6 GB each
available on-line
(password required)

The summary version is the first 400 crawled pages for each site, in the order in which they were crawled, which is similar to a breadth-first visit.

How to get a password

  1. Make sure that you have the correct graph having exactly 105,896,555 nodes (you can look at the .properties file of the graph to check this).
  2. E-mail the following agreement.
  3. We will contact you by e-mail to give you the user/password combination for downloading the summary collection shortly after receiving your message.
For inquiries contact Carlos Castillo