Fast SpamAssassin Score Learning Tool
Henry Stern
Faculty of Computer Science
Dalhousie University
6050 University Avenue
Halifax, NS Canada
B3H 1W5
henry@stern.ca
January 8, 2004
1. WHAT IS IT?
This program is used to compute scores for SpamAssassin rules. It makes
use of data files generated by the suite of scripts in
spamassassin/masses. The program outputs the generated scores in a file
titled 'perceptron.scores'.
The advantage of this program over that of the genetic algorithm (GA)
implementation in spamassassin/masses/craig_evolve.c is that while the GA
requires several hours to run on high-end machines, the perceptron
requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.
This makes incremental updates and score personalization practical for the
end-user and gives developers a better idea just how useful a new rule is.
2. OPTIONS
There are four options that can be passed to the perceptron program.
-p ham_preference
This increases the number of non-spam messages in the training
set. It does this by adding 1 + (number of tests hit) *
ham_preference instances of non-spam messages to the training set.
This is intended to reduce false positives by encouraging the
training program to look at the harder-to-clasify ham messages
more often. By default, this parameter is 2.0.
-e num_epochs
This parameter sets how many passes the perceptron will make
through the training set before terminating. On each pass, the
training set is shuffled and then iterated through. By default,
it will make 15 passes.
-l learning_rate
This parameter modifies the learning rate of the perceptron. The
error gradient is computed for each instance. This program uses a
logsig activation function y = (1/(1+exp(-x))), so the error
gradient is computed as E(x) = y*(1-y)*(is_spam - y). For
each instance and score hit in the training set, the scores are
modified by adding E(x) / (number of tests hit + 1) *
learning_rate. The default value for this is 2.0, but it can be
whatever you want.
-w weight_decay
To prevent the scores from getting too high (or even forcing them
to if you want), before each epoch, the scores and network bias
are multiplied by the weight decay. This is off by default (1.0),
but it can be useful.
3. HOW DOES IT WORK?
This program implements the "Stochastic Gradient Descent" method of
training a neural network. It uses a single perceptron with a logsig
activation function and maps the weights to SpamAssassin score space.
The perceptron is the simplest form of neural network. It consists of a
transfer function and an activation function. Together, they simulate the
average firing rate of a biological neuron over time.
The transfer function is the sum of the product of weights and inputs.
It simulates the membrane potential of a neuron. When the accumulated
charge on the membrane exceeds a certain threshold, the neuron fires,
sending an electrical impulse down the axon. This implementation uses a
linear transfer function:
[1] f(x) = bias + sum_{i=1}^{n} w_i * x_i
This is quite similar to how the SpamAssasin score system works. If you
set the bias to be 0, w_i to be the score for rule i and x_i to be whether
or not rule i is activated by a given message, the transfer function will
return the score of the message.
The activation function simulates the electical spike travelling down the
axon. It takes the output of the transfer function as input and applies
some sort of transformation to it. This implementation uses a logsig
activation function:
[2] y(x) = 1 / (1 + exp(-f(x)))
This non-linear function constrains the output of the transfer function
between 0 and 1. When plotted, it looks somewhat S-shaped with vertical
asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.
The slope of the function is greatest at x=0 and tapers off as it
approaches the asymptotes.
Lastly, the performance of the perceptron needs to be measured using an
error function. Two error functions are commonly used: mean squared
error and entropic error. By default, this implementation uses mean
squared error but entropic error may be substituted by adding a compiler
directive.
The most common method of training neural networks is called gradient
descent. It involves iteratively tuning the parameters of the network so
that the mean error rate always decreases. This is done by finding the
direction of steepest descent down the "error gradient," reducing the
value of the error function for the next iteration of the algorithm.
If the transfer function and activation function are both differentiable,
the error gradient of a neural network can be calculated with respect to
the weights and bias. Without getting into calculus, the error gradient
for a perceptron with a linear transfer function, logsig activation
function and mean squared error function is:
[3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x))
The weights are updated using the function:
[4] w_i = w_i + E(x) * x_i * learning_rate
Since the SpamAssassin rule hits are sparse, the basic gradient descent
algorithm is impractical. This implementation uses a variation called
"Stochastic gradient descent." Instead of doing one batch update per
epoch, the training set is randomly walked through, doing incremental
updates. In addition, in this implementation, the learning rate is
modified by the number of rule hits for a given training instance.
Together, these allow good weights to be computed for
infrequently-occurring rules.
Once the perceptron has finished running, the weights are converted to
scores and exported using the familiar file format. Weights are converted
to scores using this function:
[5] score(weight) = -threshold * weight / bias
4. ACKNOWLEDGEMENTS
I would like to thank my PhD supervisor, Michael Shepherd, for not getting
mad at me while I worked on this. I'd also like to thank Thomas
Trappenberg for his invaluable assistance while I was tweaking the
performance of the learning algorithm. I would like to thank Daniel
Quinlan, Justin Mason and Theo Van Dinter for their valuable input and
constructive criticism.
--
hs
8/1/2004