Spam labeling guidelines (modified from those for the UK-2006 dataset).
|Sites that do Web Spam:||Typical Web Spam Aspects:|
Typical Cases of Web Spam (tag as: SPAM)
|Pages that are full of keywords, even if they include actual contents.|
|Top: keywords. Bottom: keywords and links.||Just links with many different keywords.||The archive of a mailing list, copied to produce more keywords.||Bottom: hidden text (white text on white background, use CTRL-A to select all text, to see this type of spam).|
|Pages that contain machine generated content or too many misspellings, even if they include actual contents.|
|Machine generated page with random keyword stuffing.||"Parked domain" on misspelled name "Crate and Barrel"||Machine generated text with several misspelled words.|
that are only advertising, with very little
content. This includes automatically generated pages designed to sell advertising.
This also includes sites that offer catalogs of products that are actually redirecting to other merchants, without providing extra value.
|Left: ads. Right: keywords.||Left: ads, with ratings probably auto-generated. Right: keywords.||Left: ads. Right: more links to other pages fulls of ads.||Fake search engine that display always the same results, no matter what the query is.|
|Top and right: ads. Bottom: outdated "news" copied from news sources.||Fake search engine showing only ads for every query.||Another fake search engine showing only advertising.|
|Left: automatically generated links. Main: one line of text (probably copied) about the topic, and a page full of advertising.||Top navigation, left navigation, and basically all of the contents of this page are advertising.|
|Pages that automatically redirect users to an unrelated page i.e., a page different from what is expected based on URL, anchor text, and/or search result snippet.|
|This source URL automatically redirects to another URL when scripting is "enabled." Page with script disabled.||Page with script enabled.|
with unrelated links, or exchanging links with too many different, unrelated, partners.
Some of them may also be classified as BORDERLINE if they provide some extra value.
|Top: keywords and links. Bottom: links.||Bottom: links to Web sites made by this company. This may also be considered borderline, depending on the ratio of original content vs repetitive linking.||Left: links to unrelated sites. Bottom: keywords. This may also be considered borderline if the page also offers useful content.||Repetitive linking to the internal pages of the site.|
|Pages with unrelated/spam context, i.e., their in-links are all from editable content (blogs, forums, guestbooks, etc) and/or their in-link pages and neighboring pages in the same directory are spam, etc.||...|
Borderline cases (tag as: BORDERLINE)
|Pages that are heavily optimized for search engines or for selling advertising, but that also provide some content.|
|Links to several of its partner sites, but all the sites are topically related. Judge based on the (content for humans/content for search engines) ratio.||Heavily optimized porn site (but not all porn sites should be considered spam, unless they use spamming tricks).||One paragraph of text; the rest is only ads. The text is stolen from Wikipedia. If you can confirm that the main text is stolen, tag as SPAM. If you suspect this but cannot verify, tag as BORDERLINE.||One paragraph of text and the rest of the page is only advertising. The paragraph provides little added value and it might be argued that the Web site is spam. Tag this type of site as borderline if you are suspicious about the content/ads ratio.|
|Pages offering search engine optimization services, random link exchange, affiliate links, etc.
|Page offering to participate in a link exchange program, as it is very likely part of a link farm.||Page offering original content, a few links to unrelated sites (this is a mild link-farm), and a link to affiliate programs (this is hard to detect, can be tagged as normal).||Left: links to various unrelated sites, plus links to participate in affiliate programs. This can be normal, borderline or spam depending on the actual content/link ratio.|
Web pages that are not spam (tag as: NORMAL)
|Pages that do not use Web spam tricks. Quality is not an issue here: these can be high-quality or low-quality resources. The important thing is that they do not use Web spam techniques.|
|Website with contents. Even when the title has many keywords, they are all related to news.||On-line forum. Even when this has lots of pages and little content per page, it provides a service to users.||Shopping catalog, as long as the products are sold by the same merchant (like Amazon does), or by different merchants (like eBay does). However, if the page is always acting as a facade/redirect to one specific store, then it is borderline or spam.||On-line directory. This has many links but this is OK if the links are carefully, manually selected by human editors.|
Web sites that cannot be classified (tag as: DON'T KNOW using the button "???)
If you cannot classify a Web site in any of the three categories listed above, use the "DON'T KNOW" (button ??? in the interface). This is a "null" assessment that will not be considered for the final label of a Web site. Use this for:
[Source] Webmasters should ...
[Source] Spam pages are ...
[Source] Webmasters should avoid ...
The interface for the classification is divided into three panels: the left-top panel is for classifying pages, the left-bottom panel is for displaying information about the selected host, and the right panel is for previsualization. The previsualization panel shows often a cached version of a Web site (the page as it was seen by the crawler when it was downloaded); this is indicated by a message like this:
Note (added after the assessment process): the cache was built using only the summary version of the collection (up to 400 pages per host), so not all pages in the collection were available from the cache. When a page was not in the cache, a ``live'' version of the page was shown.
When the cached and live versions do not match (i.e.: the Web site changes completely), base your decision mostly in the cached version, unless you suspect that the Web site is providing the crawler and the users different views for spam purposes (cloaking). If the Web site you are assessing redirects you to another site, examine also the target of the redirect to assess its spam intention.
The left-top panel displays the userid, has buttons to mark the website as "Normal," "Borderline," "Spam," or "???" (can't decide). It also allows you to include an optional comment for each host. This comment is not visible by other assessors during the assessment process.
The left-bottom panel displays the following information about the site:
There will be two evaluation phases: in the first phase, you will see the in-links and out-links of the hosts. In the second phase (after you complete your assessments), you will see also the labels for those linked hosts and you will be able of revising your assessments.
On the right panel, a preview of the Web site is shown. We display the home pages of each host as they were seen at crawling time. If you continue navigating from there you'll be taken to the current version of the pages.
Use full-screen (F11 on Firefox or Explorer) to have more space for the classification interface.For inquiries contact Carlos Castillo