Research » Web Spam Detection » Datasets

Datasets for Research on Web Spam Detection

The Web spam datasets in this site are provided to advance research on Web spam detection, thanks to a collaborative effort by a team of volunteers. These labels are intended for research purposes only. We advice you not to use these labels directly for search engine ranking or filtering.

Current dataset: WEBSPAM-UK2007

The WEBSPAM-UK2007 collection is based on a crawl of the .uk domain done on May 2007, and labeled by a group of volunteers. The collection includes 114,529 hosts out of which 6,479 are labeled. See WEBSPAM-UK2007>>

Previous dataset: WEBSPAM-UK2006

The WEBSPAM-UK2006 collection is based on a crawl of the .uk domain done on May 2006, and labeled by a group of volunteers and/or by domain-specific patterns such as .gov.uk or .ac.uk. The collection includes 11,402 hosts out of which 7,473 are labeled.See WEBSPAM-UK2006>>

See also

For inquiries contact Carlos Castillo