These features are the features obtained by stacked graphical learning in: Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri: "Know your Neighbors: Web Spam Detection using the Web Topology". To appear in ACM SIGIR 2007. (Was DELIS technical report DELIS-TR-0458, 2006.) http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdf This is the general procedure for obtaining a feature like this: 1.- Train a classifier using link-based and content-based features Classifier: bagging over a C4.5 decision tree ("J48" in weka) 2.- Obtain a "spamicity" score for every host in the graph 3.- Calculate the average "spamicity" of the neighbors of each node, this is average_spamicity_neighbors_PASS1 4.- Add this as a feature to the data 5.- Train the classifier with this extra feature 6.- Use this new classifier to obtain a "spamicity" score for every host in the graph 7.- Calculate the average "spamicity" of the neighbors of each node, this is average_spamicity_neighbors_PASS2 What we actually did to obtain this feature is slightly different: 1. We took the '1: direct features', '2b: transformed link-based features' and '3a: content-based features'. 2. We partitioned the data into 10 parts of the same size 3. We repeated the following a-b-c process for i=1..10: a) Remove the partition i from the training data. Now we have 9/10 of the original instances. b) Train a classifier using these instances (training data minus part i), obtaining a model [this training was done with the classifier with bagging and the asymmetric cost matrix we describe on the paper]. c) Obtain spamicity predictions for all the instances in the partition i using the obtained model 4. For each instance in the data, we add a feature consisting of the average spamicity prediction of its neighbors. The second stacked graphical learning feature was obtained by repeating this process for a second time. Note that we do the process in this way, so that all the spamicity predictions on which the extra feature is based, were obtained when each instance was in the test part of the data. If all of your data is training data, then you should re-compute this feature again to get better performance. ======================================================================================= hostid Identifier of the host in the hostgraph hostname Name of the host, including portname if different from the default (80). Note that there are some hosts that have more than one port open average_spamicity_neighbors_PASS1 Pass 1 of stacked graphical learning. average_spamicity_neighbors_PASS2 Pass 2 of stacked graphical learning.