Research » Web Spam Detection » Datasets » WEBSPAM-UK2006 » HTML Contents

How to get the contents of UK-2006

The contents of UK-2006 are provided in WARC format, a standard proposed by The Internet Archive. You can see an example of how the data looks like.

We are providing two versions of the data:

Full versionSummary version
ContentsAll 77M pagesUp to 400 pages per host
obtained by breadth-first search
Compressed size420 GB15 GB
Physical format8 files of ~50 GB each
(not available)
8 files of ~2 GB each
(not available)
(see instructions below)

The summary version is the first 400 crawled pages for each site (in crawl order). If we entered the site from its home page, this coincides with a breadth-first visit from the home page. If we entered at some other point, the breadth-first visit started from that point. It is very likely that a link to the home page appears in the entry point or close to it, so almost all the summary versions will contain the home page of the host.

Instructions

  1. Make sure that you have the correct graph having exactly 77,741,046 nodes (you can look at the .properties file of the graph to check this).

  2. E-mail the following agreement.
  3. We will contact you by e-mail to give you the user/password combination for downloading the summary collection shortly after receiving your message.
For inquiries contact Carlos Castillo