The Web spam datasets in this site are provided to advance research on Web spam detection, thanks to a collaborative effort by a team of volunteers. These labels are intended for research purposes only. We advice you not to use these labels directly for search engine ranking or filtering. See licensing information.
If you use this data, we strongly recommend you to subscribe to our mailing list. New datasets, errata about the current datasets, challenges and conferences related to Web spam are posted to this low-volume, announcements-only mailing list.
This is a large collection of annotated spam/nonspam hosts labeled by a group of volunteers. The base data is a set of 105,896,555 pages in 114,529 hosts in the .UK domain. The data was downloaded in May 2007 by the Laboratory of Web Algorithmics, Università degli Studi di Milano, with the support of the DELIS EU - FET research project.
The assessment was done by a group of volunteers, see credits. These are the guidelines that were given for the assessment. If you use the WEBSPAM-UK2007 collection, you can cite as:
"Web Spam Collections". http://chato.cl/webspam/datasets/ Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.di.unimi.it/. URLs retrieved MM YYYY.
The WEBSPAM-UK2007 data set is composed of three parts:
For the purpose of the Web Spam Challenge 2008, the labels were released in two sets. SET1, containing roughly 2/3 of the assessed hosts was given for training, while SET2 containing the remaining 1/3, was held for testing. The results are available.
Optionally, there is a set of pre-computed features over the hosts of this collection, and a list of comments for some hosts, written by the assessors during the evaluation.