The contents of UK-2007 are provided in WARC format, a standard proposed by The Internet Archive. You can see an example of how the data looks like. LAW library 1.3+ includes methods for reading these records from Java.
We are providing a summary version of the data:
Full version | Summary version | |
---|---|---|
Contents | All pages: 105M pages | Up to 400 pages per host: 12M pages |
Size | 560 GB compressed | 46 GB compressed 200 GB uncompressed |
Physical format | 8 files of ~70 GB each (not available) | 8 files of ~6 GB each available on-line (password required) |
The summary version is the first 400 crawled pages for each site, in the order in which they were crawled, which is similar to a breadth-first visit.