Note that a newer dataset is available.
The Web spam datasets in this site are provided to advance research on Web spam detection, thanks to a collaborative effort by a team of volunteers. These labels are intended for research purposes only. We advice you not to use these labels directly for search engine ranking or filtering. See licensing information.
If you use this data, we strongly recommend you to subscribe to our mailing list. New datasets, errata about the current datasets, challenges and conferences related to Web spam are posted to this low-volume, announcements-only mailing list.
This collection was obtained using a large set of .UK pages downloaded in May 2006 by the Laboratory of Web Algorithmics, Università degli Studi di Milano with the support of the DELIS EU - FET research project.
The assessment was done by a group of volunteers, see credits. These are the guidelines that were given for the assessment. A detailed description of the process can be found at this reference:
Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Massimo Santini and Sebastiano Vigna: "A Reference Collection for Web Spam". SIGIR Forum, Vol. 40, Num. 2, December 2006 [bibtex].
Please note that a newer dataset is available.
The data set is composed of three parts:
Additionally, feature vectors are available on the Web Spam Challenge website: feature vectors and text-based feature vectors.
Added on May 2009, additional labels for 152 spam hosts are available, contributed by Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa. The labels were obtained after inspection of large strongly-connected components [download].