Typical Cases of Web Spam (tag as: SPAM)
|
Pages
that are full of keywords, even if they include actual contents. |
|
|
|
|
Top: keywords. Bottom: keywords and links. |
Just links with many different keywords. |
The archive of a mailing list, copied to produce more keywords. |
Bottom: hidden text (white text on white background, use CTRL-A to select all text, to see this type of spam). |
| | | | |
Pages
that contain machine generated content or too many misspellings, even if they include actual contents. |
|
|
|
Machine generated page with random keyword stuffing. |
"Parked domain" on misspelled name "Crate and Barrel" |
Machine generated text with several misspelled words. |
| | | | |
Pages
that are only advertising, with very little
content. This includes automatically generated pages designed to sell advertising.
This also includes sites that offer catalogs of products that
are actually redirecting to other merchants, without providing extra value.
|
|
|
|
|
Left: ads. Right: keywords. |
Left: ads, with ratings probably auto-generated. Right: keywords. |
Left: ads. Right: more links to other pages fulls of ads. |
Fake search engine that display always the same results, no matter what the query is. |
| | | | |
|
|
|
Top and right: ads. Bottom: outdated "news" copied from news sources. |
Fake search engine showing only ads for every query. |
Another fake search engine showing only advertising. |
|
| | | | |
|
|
Left: automatically generated links. Main: one line of text (probably copied) about the topic, and a page full of advertising. |
Top navigation, left navigation, and basically all of the contents of this page are advertising. |
|
| | | | |
Pages
that automatically redirect users to an unrelated page i.e., a page different from what is expected based on URL, anchor text, and/or search result snippet. |
|
|
This source URL automatically redirects to another URL when scripting is "enabled." Page with script disabled. |
Page with script enabled. |
| | | | |
Pages
with unrelated links, or exchanging links with too many different, unrelated, partners.
Some of them may also be classified as BORDERLINE if they provide some extra value. |
|
|
|
|
Top: keywords and links. Bottom: links. |
Bottom: links to Web sites made by this company. This may also be considered borderline, depending on the ratio of original content vs repetitive linking. |
Left: links to unrelated sites. Bottom: keywords. This may also be considered borderline if the page also offers useful content. |
Repetitive linking to the internal pages of the site. |
| | | | |
Pages
with unrelated/spam context, i.e., their in-links are all from editable content (blogs, forums, guestbooks, etc) and/or their in-link pages and neighboring pages in the same directory are spam, etc. |
... |
... |
| | | | |
Borderline cases (tag as: BORDERLINE)
|
Pages that are heavily optimized for search engines or for selling advertising, but that also provide some content. |
|
|
|
|
Links to several of its partner sites, but all the sites are topically related. Judge based on the (content for humans/content for search engines) ratio. |
Heavily optimized porn site (but not all porn sites should be considered spam, unless they use spamming tricks). |
One paragraph of text; the rest is only ads. The text is stolen from Wikipedia. If you can confirm that the main text is stolen, tag as SPAM. If you suspect this but cannot verify, tag as BORDERLINE. |
One paragraph of text and the rest of the page is only advertising. The paragraph provides little added value and it might be argued that the Web site is spam. Tag this type of site as borderline if you are suspicious about the content/ads ratio. |
| | | | |
Pages offering search engine optimization services, random link exchange, affiliate links, etc.
|
|
|
|
|
Page offering to participate in a link exchange program, as it is very likely part of a link farm. |
Page offering original content, a few links to unrelated sites (this is a mild link-farm), and a link to affiliate programs (this is hard to detect, can be tagged as normal). |
Left: links to various unrelated sites, plus links to participate in affiliate programs. This can be normal, borderline or spam depending on the actual content/link ratio. |
|
| | | | |
Web pages that are not spam (tag as: NORMAL)
|
Pages that do not use Web spam tricks. Quality is not an issue here: these can be high-quality or low-quality resources. The important thing is that they do not use Web spam techniques. |
|
|
|
|
Website with contents. Even when the title has many keywords, they are all related to news. |
On-line forum. Even when this has lots of pages and little content per page, it provides a service to users. |
Shopping catalog, as long as the products are sold by the same merchant (like Amazon does), or by different merchants (like eBay does). However, if the page is always acting as a facade/redirect to one specific store, then it is borderline or spam. |
On-line directory. This has many links but this is OK if the links are carefully, manually selected by human editors. |
| | | | |
Web sites that cannot be classified (tag as: DON'T KNOW using the button "???)
|
|
If you cannot classify a Web site in any of the three categories listed above, use the "DON'T KNOW" (button ??? in the interface). This is a "null" assessment that will not be considered for the final label of a Web site. Use this for:
- Web sites that you believe that you cannot classify.
- Web sites for which the cached home page and the random pages show only blank pages, or only "40X Not found" messages.
- Web sites that you cannot access, or that require a password.
|
Miscellaneous
|
|
- Take your time and look at different aspects: examine the Web sites carefully to
check different aspects before taking a decision, particularly in sites that are hard to classify.
- If you are in doubt: If you find that a host uses techniques that
perhaps could be considered spamming, but you are not sure about giving
the site a SPAM or NORMAL label,
choose BORDERLINE.
- If you cannot classify a site: If you find that you cannot
give a label to a site, choose DON'T
KNOW (button ???).
- We are interested in finding
hosts that use spamming techniques, so if you find one such
page in a host it is correct to classify the entire host as
SPAM or
BORDERLINE.
- In the case of mirrors (copies of a
host), give the same rating to the mirror as to the main
(target) host.
|
| | | | |