These per-host feature sets are provided to encourage participation on the Web Spam Challenge 2008. They are also available in Matlab and ARFF (for weka) format.
In the data the host IDs are assigned in the same ordering as in the uk-2007-05.hostnames.txt.gz file. The collection contains 114,529 different hosts numbered from 0 to 114,528.
Computed from the graph files. Includes two direct, obvious features: the number of pages in the host and the number of characters in the host name.
Computed from the graph files. Contains link-based features for the hosts, measured in both the home page and the page with the maximum PageRank in each host. Includes in-degree, out-degree, PageRank, edge reciprocity, assortativ ity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See description.
The list of the url-id of the home page and the page with the maximum PageRank of each host is also available here:
Computed from the graph files. Contains simple numeric transformations of the link-based features for the hosts:
These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features. See description.
Computed from the summary version of the contents. These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See description.
Please report any issues you find with these pre-computed features. Remember to subscribe to our mailing list if you use this data. New feature sets and errata about the feature sets, are posted to this low-volume, announcements-only mailing list.
For inquiries contact Carlos Castillo