Research » Web Spam Detection » Datasets » WEBSPAM-UK2006 » Guidelines

These were the guidelines given to the volunteers that labelled the UK-2006 dataset.

General definition of Web spam:
«any deliberate action that is meant to trigger an unjustifiably favorable [ranking], considering the page's true value» (Gyöngyi and García Molina 2005).

When classifying, ask this question: are there aspects of this page that are mostly to attract and/or redirect traffic?. We are interested in detecting Web sites that employ spam techniques.

Sites that do Web Spam: Typical Web Spam Aspects:
  • Include aspects designed to attract/redirect traffic.
  • Almost always have commercial intent.
  • Rarely offer relevant content for users browsing them.
  • Include many unrelated keywords and links.
  • Use many keywords in the URL.
  • Redirect the user to another page.
  • Create many copies with substantially duplicate content.
  • Hide text by writing in the same color as the background of the page.

Typical Cases of Web Spam (tag as: SPAM)

Pages that are full of keywords, even if they include actual contents.
Just links with many different keywords. Just links with many different keywords. The archive of a mailing list, copied to produce more keywords. Bottom: hidden text (white text on white background, use CTRL-A to select all text, to see this type of spam).
 
Pages that are only advertising, with very little content. This includes automatically generated pages designed to sell advertising.

This also includes sites that offer catalogs of products that are actually redirecting to other merchants, without providing extra value.
Left: ads. Right: keywords. Left: ads, with ratings probably auto-generated. Right: keywords. Left: ads. Right: more links to other pages fulls of ads. Fake search engine that display always the same results, no matter what the query is.
 
Top and right: ads. Bottom: outdated "news" copied from news sources. Fake search engine showing only ads for every query. Another fake search engine showing only advertising.
 
Left: automatically generated links. Main: one line of text (probably copied) about the topic, and a page full of advertising. Top navigation, left navigation, and basically all of the contents of this page are advertising.
Pages with unrelated links, or exchanging links with too many different, unrelated, partners.
Some of them may also be classified as BORDERLINE if they provide some extra value.
Top: keywords and links. Bottom: links. Bottom: links to Web sites made by this company. This may also be considered borderline, depending on the ratio of original content vs repetitive linking. Left: links to unrelated sites. Bottom: keywords. This may also be considered borderline if the page also offers useful content. Repetitive linking to the internal pages of the site.
 

Borderline cases (tag as: BORDERLINE)

Pages that are heavily optimized for search engines or for selling advertising, but that also provide some content.
Links to several of its partner sites, but all the sites are topically related. Judge based on the (content for humans/content for search engines) ratio. Heavily optimized porn site (but not all porn sites should be considered spam, unless they use spamming tricks). One paragraph of text; the rest is only ads. The text is stolen from Wikipedia (if you detect this, mark as SPAM), but as this is not easy to check, you can tag this page as BORDERLINE. One paragraph of text and the rest of the page is only advertising. The paragraph provides little added value and it might be argued that the Web site is spam. Tag this type of site as borderline if you are suspicious about the content/ads ratio.
 
Pages offering search engine optimization services, random link exchange, affiliate links, etc.
Page offering to participate in a link exchange program, as it is very likely part of a link farm. Page offering original content, a few links to unrelated sites (this is a mild link-farm), and a link to affiliate programs (this is hard to detect, can be tagged as normal). Left: links to various unrelated sites, plus links to participate in affiliate programs. This can be normal, borderline or spam depending on the actual content/link ratio.
 

Web pages that are not spam (tag as: NORMAL)

Pages that do not use Web spam tricks. Quality is not an issue here: these can be high-quality or low-quality resources. The important thing is that they do not use Web spam techniques.
Website with contents. Even when the title has many keywords, they are all related to news. On-line forum. Even when this has lots of pages and little content per page, it provides a service to users. Shopping catalog, as long as the products are sold by the same merchant (like Amazon does), or by different merchants (like eBay does). However, if the page is always acting as a facade/redirect to one specific store, then it is borderline or spam. On-line directory. This has many links but this is OK if the links are carefully, manually selected by human editors.
 

Web pages that cannot be classified (tag as: DON'T KNOW)

  • Web sites that you cannot access, or require a password.
  • Web sites that you do not know how to classify.
  • "Parked" domains with a single page (sometimes they show a "this domain is for sale" message with lots of links).
 

What search engines consider as Web spam

Google

[Source] Webmasters should ...

Yahoo!

[Source] Spam pages are ...

MSN Search

[Source] Webmasters should avoid ...

Help for the classification

The interface for the classification is divided into three panels: the left panel is for classifying pages, the center panel is for displaying information about the selected host, and the right panel is for previsualization.

Left panel

These are the controls for the left panel:

Center panel

In the center panel, the following information is shown:

Also, the following extra information is shown. You should use this extra information with care, do not base your decision solely on Google's PageRank or Alexa's TrafficRank as those can be deceived by the Web spammers we want to detect:

Right panel

On the right panel, a preview of the Web site is shown. We display the home pages of each host as they were seen at crawling time. If you continue navigating from there you'll be taken to the current version of the pages.

Hints

  • Some Web sites are very difficult to classify, in those cases, try checking many different aspects of the Web site before taking a decision.

    Use full-screen (F11 on FireFox or Explorer) to have more space for the classification interface.

    SAVE OFTEN. Many spam sites use tricks for hijacking your browser, opening pop-ups, pop-unders, redirecting or removing frames.

    To avoid these problems, we advise to temporarily enable Javascript only from chato.cl while you classify pages.

    For inquiries contact Carlos Castillo