These features are the features obtained by stacked graphical learning in:

Carlos Castillo, Debora Donato, Aristides Gionis,
Vanessa Murdock, Fabrizio Silvestri:
"Know your Neighbors: Web Spam Detection using the Web Topology".
To appear in ACM SIGIR 2007.
(Was DELIS technical report DELIS-TR-0458, 2006.)
http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdf

This is the general procedure for obtaining a feature like this:

1.- Train a classifier using link-based and content-based features
    Classifier: bagging over a C4.5 decision tree ("J48" in weka)
2.- Obtain a "spamicity" score for every host in the graph
3.- Calculate the average "spamicity" of the neighbors of each node,
    this is average_spamicity_neighbors_PASS1
4.- Add this as a feature to the data
5.- Train the classifier with this extra feature
6.- Use this new classifier to obtain a "spamicity" score for every host
    in the graph 
7.- Calculate the average "spamicity" of the neighbors of each node,
    this is average_spamicity_neighbors_PASS2

What we actually did to obtain this feature is slightly different:

1. We took the '1: direct features', '2b: transformed link-based
features' and '3a: content-based features'.

2. We partitioned the data into 10 parts of the same size

3. We repeated the following a-b-c process for i=1..10:

a) Remove the partition i from the training data. Now we have 9/10 of
the original instances.

b) Train a classifier using these instances (training data minus part
i), obtaining a model [this training was done with the classifier with
bagging and the asymmetric cost matrix we describe on the paper].

c) Obtain spamicity predictions for all the instances in the partition i
using the obtained model

4. For each instance in the data, we add a feature consisting of the
average spamicity prediction of its neighbors. 

The second stacked graphical learning feature was obtained by repeating
this process for a second time.

Note that we do the process in this way, so that all the spamicity predictions
on which the extra feature is based, were obtained when each instance was in
the test part of the data. If all of your data is training data, then you
should re-compute this feature again to get better performance.


=======================================================================================

hostid
Identifier of the host in the hostgraph

hostname
Name of the host, including portname if different from the default (80). Note that there are some hosts that have more than one port open

average_spamicity_neighbors_PASS1
Pass 1 of stacked graphical learning.

average_spamicity_neighbors_PASS2
Pass 2 of stacked graphical learning.