Research » Web Spam Detection » Datasets » WEBSPAM-UK2007 » Guidelines for WEBSPAM-UK2007

Guidelines for WEBSPAM-UK2007

 PRINT THIS PAGE

Spam labeling guidelines (modified from those for the UK-2006 dataset).

General definition of Web spam:
«any deliberate action that is meant to trigger an unjustifiably favorable [ranking], considering the page's true value» (Gyöngyi and García Molina 2005).

When classifying, ask this question: are there aspects of this page that are mostly to attract and/or redirect traffic?. We are interested in detecting Web sites that employ spam techniques.

Sites that do Web Spam: Typical Web Spam Aspects:
  • Include aspects designed to attract/redirect traffic.
  • Almost always have commercial intent.
  • Rarely offer relevant content for users browsing them.
  • Include many unrelated keywords and links.
  • Use many keywords and punctuation marks such as dashes in the URL.
  • Redirect the user to another (usually unrelated) page.
  • Create many copies with substantially duplicate content.
  • Hide text by writing in the same color as the background of the page.

Typical Cases of Web Spam (tag as: SPAM)

Pages that are full of keywords, even if they include actual contents.
Top: keywords. Bottom: keywords and links. Just links with many different keywords. The archive of a mailing list, copied to produce more keywords. Bottom: hidden text (white text on white background, use CTRL-A to select all text, to see this type of spam).
 
Pages that contain machine generated content or too many misspellings, even if they include actual contents.
Machine generated page with random keyword stuffing. "Parked domain" on misspelled name "Crate and Barrel" Machine generated text with several misspelled words.
 
Pages that are only advertising, with very little content. This includes automatically generated pages designed to sell advertising.

This also includes sites that offer catalogs of products that are actually redirecting to other merchants, without providing extra value.
Left: ads. Right: keywords. Left: ads, with ratings probably auto-generated. Right: keywords. Left: ads. Right: more links to other pages fulls of ads. Fake search engine that display always the same results, no matter what the query is.
 
Top and right: ads. Bottom: outdated "news" copied from news sources. Fake search engine showing only ads for every query. Another fake search engine showing only advertising.
 
Left: automatically generated links. Main: one line of text (probably copied) about the topic, and a page full of advertising. Top navigation, left navigation, and basically all of the contents of this page are advertising.
 
Pages that automatically redirect users to an unrelated page i.e., a page different from what is expected based on URL, anchor text, and/or search result snippet.
This source URL automatically redirects to another URL when scripting is "enabled." Page with script disabled. Page with script enabled.
 
Pages with unrelated links, or exchanging links with too many different, unrelated, partners.
Some of them may also be classified as BORDERLINE if they provide some extra value.
Top: keywords and links. Bottom: links. Bottom: links to Web sites made by this company. This may also be considered borderline, depending on the ratio of original content vs repetitive linking. Left: links to unrelated sites. Bottom: keywords. This may also be considered borderline if the page also offers useful content. Repetitive linking to the internal pages of the site.
 
Pages with unrelated/spam context, i.e., their in-links are all from editable content (blogs, forums, guestbooks, etc) and/or their in-link pages and neighboring pages in the same directory are spam, etc. ...
...
 

Borderline cases (tag as: BORDERLINE)

Pages that are heavily optimized for search engines or for selling advertising, but that also provide some content.
Links to several of its partner sites, but all the sites are topically related. Judge based on the (content for humans/content for search engines) ratio. Heavily optimized porn site (but not all porn sites should be considered spam, unless they use spamming tricks). One paragraph of text; the rest is only ads. The text is stolen from Wikipedia. If you can confirm that the main text is stolen, tag as SPAM. If you suspect this but cannot verify, tag as BORDERLINE. One paragraph of text and the rest of the page is only advertising. The paragraph provides little added value and it might be argued that the Web site is spam. Tag this type of site as borderline if you are suspicious about the content/ads ratio.
 
Pages offering search engine optimization services, random link exchange, affiliate links, etc.
Page offering to participate in a link exchange program, as it is very likely part of a link farm. Page offering original content, a few links to unrelated sites (this is a mild link-farm), and a link to affiliate programs (this is hard to detect, can be tagged as normal). Left: links to various unrelated sites, plus links to participate in affiliate programs. This can be normal, borderline or spam depending on the actual content/link ratio.
 

Web pages that are not spam (tag as: NORMAL)

Pages that do not use Web spam tricks. Quality is not an issue here: these can be high-quality or low-quality resources. The important thing is that they do not use Web spam techniques.
Website with contents. Even when the title has many keywords, they are all related to news. On-line forum. Even when this has lots of pages and little content per page, it provides a service to users. Shopping catalog, as long as the products are sold by the same merchant (like Amazon does), or by different merchants (like eBay does). However, if the page is always acting as a facade/redirect to one specific store, then it is borderline or spam. On-line directory. This has many links but this is OK if the links are carefully, manually selected by human editors.
 

Web sites that cannot be classified (tag as: DON'T KNOW using the button "???)

If you cannot classify a Web site in any of the three categories listed above, use the "DON'T KNOW" (button ??? in the interface). This is a "null" assessment that will not be considered for the final label of a Web site. Use this for:

  • Web sites that you believe that you cannot classify.
  • Web sites for which the cached home page and the random pages show only blank pages, or only "40X Not found" messages.
  • Web sites that you cannot access, or that require a password.

Miscellaneous

  • Take your time and look at different aspects: examine the Web sites carefully to check different aspects before taking a decision, particularly in sites that are hard to classify.
  • If you are in doubt: If you find that a host uses techniques that perhaps could be considered spamming, but you are not sure about giving the site a SPAM or NORMAL label, choose BORDERLINE.
  • If you cannot classify a site: If you find that you cannot give a label to a site, choose DON'T KNOW (button ???).
  • We are interested in finding hosts that use spamming techniques, so if you find one such page in a host it is correct to classify the entire host as SPAM or BORDERLINE.
  • In the case of mirrors (copies of a host), give the same rating to the mirror as to the main (target) host.
 

What search engines consider as Web spam

Google

[Source] Webmasters should ...

Yahoo!

[Source] Spam pages are ...

Live Search

[Source] Webmasters should avoid ...

Help for the classification

The interface for the classification is divided into three panels: the left-top panel is for classifying pages, the left-bottom panel is for displaying information about the selected host, and the right panel is for previsualization. The previsualization panel shows often a cached version of a Web site (the page as it was seen by the crawler when it was downloaded); this is indicated by a message like this:

This is a cached copy of http://www.example.com/.

Note (added after the assessment process): the cache was built using only the summary version of the collection (up to 400 pages per host), so not all pages in the collection were available from the cache. When a page was not in the cache, a ``live'' version of the page was shown.

When the cached and live versions do not match (i.e.: the Web site changes completely), base your decision mostly in the cached version, unless you suspect that the Web site is providing the crawler and the users different views for spam purposes (cloaking). If the Web site you are assessing redirects you to another site, examine also the target of the redirect to assess its spam intention.

Left-Top panel

The left-top panel displays the userid, has buttons to mark the website as "Normal," "Borderline," "Spam," or "???" (can't decide). It also allows you to include an optional comment for each host. This comment is not visible by other assessors during the assessment process.

Left-Bottom panel

The left-bottom panel displays the following information about the site:

There will be two evaluation phases: in the first phase, you will see the in-links and out-links of the hosts. In the second phase (after you complete your assessments), you will see also the labels for those linked hosts and you will be able of revising your assessments.

Right panel

On the right panel, a preview of the Web site is shown. We display the home pages of each host as they were seen at crawling time. If you continue navigating from there you'll be taken to the current version of the pages.

Hints

Many spam sites use tricks for hijacking your browser, opening pop-ups, pop-unders, redirecting or removing frames. To avoid these problems, we advise to temporarily enable Javascript only from www-connex.lip6.fr while you classify pages.

Use full-screen (F11 on Firefox or Explorer) to have more space for the classification interface.

For inquiries contact Carlos Castillo