You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by fe...@apache.org on 2004/01/27 20:19:11 UTC

svn commit: rev 6321 - incubator/spamassassin/trunk/masses

Author: felicity
Date: Tue Jan 27 11:19:10 2004
New Revision: 6321

Added:
   incubator/spamassassin/trunk/masses/README.perceptron
   incubator/spamassassin/trunk/masses/perceptron.c
Log:
bug 2910: looks like these weren't "svn add"ed ... oops!


Added: incubator/spamassassin/trunk/masses/README.perceptron
==============================================================================
--- (empty file)
+++ incubator/spamassassin/trunk/masses/README.perceptron	Tue Jan 27 11:19:10 2004
@@ -0,0 +1,154 @@
+Copyright (c)2003 Henry Stern
+
+Fast SpamAssassin Score Learning Tool
+
+Henry Stern
+Faculty of Computer Science
+Dalhousie University
+6050 University Avenue
+Halifax, NS  Canada
+B3H 1W5
+henry@stern.ca
+
+January 8, 2004
+
+1.  WHAT IS IT?
+
+This program is used to compute scores for SpamAssassin rules.  It makes
+use of data files generated by the suite of scripts in
+spamassassin/masses.  The program outputs the generated scores in a file 
+titled 'perceptron.scores'.
+
+The advantage of this program over that of the genetic algorithm (GA)
+implementation in spamassassin/masses/craig_evolve.c is that while the GA 
+requires several hours to run on high-end machines, the perceptron 
+requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.
+
+This makes incremental updates and score personalization practical for the
+end-user and gives developers a better idea just how useful a new rule is.
+
+2.  OPTIONS
+
+There are four options that can be passed to the perceptron program.
+
+  -p ham_preference
+
+	This increases the number of non-spam messages in the training 
+	set.  It does this by adding 1 + (number of tests hit) * 
+	ham_preference instances of non-spam messages to the training set.
+	This is intended to reduce false positives by encouraging the
+	training program to look at the harder-to-clasify ham messages 
+	more often.  By default, this parameter is 2.0.
+
+  -e num_epochs
+
+	This parameter sets how many passes the perceptron will make 
+	through the training set before terminating.  On each pass, the 
+	training set is shuffled and then iterated through.  By default, 
+	it will make 15 passes.
+
+  -l learning_rate
+
+	This parameter modifies the learning rate of the perceptron.  The 
+	error gradient is computed for each instance.  This program uses a 
+	logsig activation function y = (1/(1+exp(-x))), so the error 
+	gradient is computed as E(x) = y*(1-y)*(is_spam - y).  For 
+	each instance and score hit in the training set, the scores are 
+	modified by adding E(x) / (number of tests hit + 1) * 
+	learning_rate.  The default value for this is 2.0, but it can be 
+	whatever you want.
+
+  -w weight_decay
+
+	To prevent the scores from getting too high (or even forcing them 
+	to if you want), before each epoch, the scores and network bias
+	are multiplied by the weight decay.  This is off by default (1.0), 
+	but it can be useful.
+
+3.  HOW DOES IT WORK?
+
+This program implements the "Stochastic Gradient Descent" method of 
+training a neural network.  It uses a single perceptron with a logsig 
+activation function and maps the weights to SpamAssassin score space.
+
+The perceptron is the simplest form of neural network.  It consists of a 
+transfer function and an activation function.  Together, they simulate the 
+average firing rate of a biological neuron over time.
+
+The transfer function is the sum of the product of weights and inputs.  
+It simulates the membrane potential of a neuron.  When the accumulated
+charge on the membrane exceeds a certain threshold, the neuron fires,
+sending an electrical impulse down the axon. This implementation uses a
+linear transfer function:
+
+[1]	f(x) = bias + sum_{i=1}^{n} w_i * x_i
+
+This is quite similar to how the SpamAssasin score system works.  If you 
+set the bias to be 0, w_i to be the score for rule i and x_i to be whether 
+or not rule i is activated by a given message, the transfer function will 
+return the score of the message.  
+
+The activation function simulates the electical spike travelling down the 
+axon.  It takes the output of the transfer function as input and applies 
+some sort of transformation to it.  This implementation uses a logsig 
+activation function:
+
+[2]	y(x) = 1 / (1 + exp(-f(x)))
+
+This non-linear function constrains the output of the transfer function 
+between 0 and 1.  When plotted, it looks somewhat S-shaped with vertical 
+asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.  
+The slope of the function is greatest at x=0 and tapers off as it 
+approaches the asymptotes.
+
+Lastly, the performance of the perceptron needs to be measured using an 
+error function.  Two error functions are commonly used:  mean squared 
+error and entropic error.  By default, this implementation uses mean 
+squared error but entropic error may be substituted by adding a compiler 
+directive.
+
+The most common method of training neural networks is called gradient
+descent.  It involves iteratively tuning the parameters of the network so
+that the mean error rate always decreases.  This is done by finding the
+direction of steepest descent down the "error gradient," reducing the
+value of the error function for the next iteration of the algorithm.
+
+If the transfer function and activation function are both differentiable,
+the error gradient of a neural network can be calculated with respect to
+the weights and bias.  Without getting into calculus, the error gradient 
+for a perceptron with a linear transfer function, logsig activation 
+function and mean squared error function is:
+
+[3]	E(x) = y(x) * (1-y(x)) * (y_expected - y(x))
+
+The weights are updated using the function:
+
+[4]	w_i = w_i + E(x) * x_i * learning_rate
+
+Since the SpamAssassin rule hits are sparse, the basic gradient descent 
+algorithm is impractical.  This implementation uses a variation called 
+"Stochastic gradient descent."  Instead of doing one batch update per 
+epoch, the training set is randomly walked through, doing incremental 
+updates.  In addition, in this implementation, the learning rate is 
+modified by the number of rule hits for a given training instance.  
+Together, these allow good weights to be computed for 
+infrequently-occurring rules.
+
+Once the perceptron has finished running, the weights are converted to 
+scores and exported using the familiar file format.  Weights are converted 
+to scores using this function:
+
+[5]	score(weight) = -threshold * weight / bias
+
+4.  ACKNOWLEDGEMENTS
+
+I would like to thank my PhD supervisor, Michael Shepherd, for not getting 
+mad at me while I worked on this.  I'd also like to thank Thomas 
+Trappenberg for his invaluable assistance while I was tweaking the 
+performance of the learning algorithm.  I would like to thank Daniel 
+Quinlan, Justin Mason and Theo Van Dinter for their valuable input and 
+constructive criticism.
+
+--
+hs
+8/1/2004

Added: incubator/spamassassin/trunk/masses/perceptron.c
==============================================================================
--- (empty file)
+++ incubator/spamassassin/trunk/masses/perceptron.c	Tue Jan 27 11:19:10 2004
@@ -0,0 +1,425 @@
+/* Copyright (c)2003  Henry Stern
+ *
+ * This program uses stochastic gradient descent to learn a scoreset for
+ * SpamAssassin.  You'll need to run logs-to-c from spamassassin/masses to
+ * generate the stuff in tmp.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <math.h>
+#include <unistd.h>
+
+#include "tmp/scores.h"
+#include "tmp/tests.h"
+
+/* Ensure that multiple error functions have not been chosen. */
+#ifdef ENTROPIC_ERROR
+#ifdef LEAST_SQUARES_ERROR
+#warn Both entropic error and least squares error chosen.  Using least squares.
+#endif
+#endif
+
+/* Choose the least squares error function by default. */
+#ifndef ENTROPIC_ERROR
+#ifndef LEAST_SQUARES_ERROR
+#define LEAST_SQUARES_ERROR
+#endif
+#endif
+
+#define OUTPUT_FILE "perceptron.scores"
+
+void init_wheel ();
+void destroy_wheel ();
+int get_random_test ();
+void init_weights();
+void destroy_weights ();
+void write_weights (FILE * fp);
+double evaluate_test (int test);
+double evaluate_test_nogain (int test);
+void train (int num_epochs, double learning_rate);
+void usage ();
+
+/* Converts a weight to a SpamAssassin score. */
+#define weight_to_score(x) (-5*(x)/bias)
+#define score_to_weight(x) (-(x)*bias/5)
+
+int wheel_size; /* The number of entries in the roulette wheel (NOT THE
+		   SIZE OF THE ROULETTE_WHEEL ARRAY!). */
+int * roulette_wheel; /* Used for roulette wheel selection. */
+double ham_preference = 2.0;
+
+double * weights; /* The weights of the single-layer perceptron. */
+double bias; /* The network bias for the single-layer perceptron. */
+
+int num_epochs = 15;
+double learning_rate = 2.0;
+double weight_decay = 1.0;
+
+/* Initialize the roulette wheel and populate it, replicating harder-to-classify hams. */
+void init_wheel () {
+	int i;
+	int spam = 0, ham = 0;
+
+	/* cache locations on the roulette wheel for fast lookups */
+	roulette_wheel = (int*)calloc(num_nondup, sizeof(int));
+	wheel_size = 0;
+
+	roulette_wheel[0] = 0;
+
+	for (i = 0; i < num_nondup - 1; i++) {
+		int slot_size = 1;
+
+		/* Hams with more tests are rare and harder to classify but are the
+		 * most important to classify correctly.  They are thus replicated in the
+		 * training set proportionally to their difficulty. */
+		if ( ! is_spam[i] ) {
+			slot_size += (int)(num_tests_hit[i] * ham_preference);
+		}
+
+		/* The database is compressed with all instances mapped in the same place. */
+		slot_size *= tests_count[i];
+		wheel_size += slot_size;
+
+		if ( ! is_spam[i] ) {
+			ham += slot_size;
+		} else {
+			spam++;
+		}
+
+		roulette_wheel[i+1] = roulette_wheel[i] + slot_size;
+	}
+
+	printf ("Modified training set statistics: %d spam, %d ham.\n", spam, ham);
+}
+
+/* Free the resources for the roulette wheel selector. */
+void destroy_wheel () {
+	if ( roulette_wheel ) {
+		free (roulette_wheel);
+		roulette_wheel = 0;
+		wheel_size = 0;
+	}
+}
+
+/* Get a random test using roulette wheel selection.  This is not used anymore. */
+int get_random_test () {
+	int r;
+	int bottom, middle, top;
+
+	r = lrand48() % wheel_size;
+
+	bottom = 0;
+	top = num_nondup - 1;
+	middle = top / 2;
+
+	while (1) {
+		if ( r < roulette_wheel[bottom+1] ) {
+			return bottom;
+		} else if ( r >= roulette_wheel[top] ) {
+			return top;
+		} else if ( r >= roulette_wheel[middle] && r < roulette_wheel[middle+1] ) {
+			return middle;
+		} else if ( r < roulette_wheel[middle] ) {
+			top = middle-1;
+			bottom++;
+			middle = bottom + (top-bottom)/2;
+		} else {
+			bottom = middle+1;
+			top--;
+			middle = bottom + (top-bottom)/2;
+		}
+	}
+}
+
+/* Allocate and initialize the weights over the range [-0.5..0.5] */
+void init_weights () {
+	int i;
+
+	weights = (double*)calloc(num_scores, sizeof(double));
+
+	bias = drand48() - 0.5;
+	for (i = 0; i < num_scores; i++) {
+		weights[i] = drand48() - 0.5;
+	}
+}
+
+/* Free the resources allocated for the weights. */
+void destroy_weights () {
+	if ( weights ) {
+		free(weights);
+		weights = 0;
+	}
+}
+
+/* Writes out the weights in SpamAssassin score space. */
+void write_weights (FILE * fp) {
+	int i;
+	int ga_nn, ga_yy, ga_ny, ga_yn;
+	double nnscore, yyscore, nyscore, ynscore;
+	double threshold = 5;
+
+	ga_nn = ga_yy = ga_ny = ga_yn = 0;
+	nnscore = yyscore = nyscore = ynscore = 0;
+
+	/* Run through all of the instances in the training set and tally up
+	 * the scores. */
+	for (i = 0; i < num_nondup; i++) {
+		double score = weight_to_score(evaluate_test_nogain(i)) + 5;
+
+		if ( score >= threshold && is_spam[i] ) {
+			ga_yy += tests_count[i];
+			yyscore += tests_count[i] * score;
+		} else if ( score < threshold && !is_spam[i] ) {
+			ga_nn += tests_count[i]; 
+			nnscore += tests_count[i] * score;
+		} else if ( score >= threshold && !is_spam[i] ) {
+			ga_ny += tests_count[i];
+			nyscore += tests_count[i] * score;
+		} else {
+			ga_yn += tests_count[i];
+			ynscore += tests_count[i] * score;
+		}
+	}
+
+	/* This is copied from the dump() function in craig-evolve.c.  It 
+	 * outputs some nice statistics about the learned classifier. */
+	fprintf (fp,"\n# SUMMARY for threshold %3.1f:\n", threshold);
+	fprintf (fp,
+			"# Correctly non-spam: %6d  %4.2f%%  (%4.2f%% of non-spam corpus)\n",
+			ga_nn,
+			(ga_nn / (float) num_tests) * 100.0,
+			(ga_nn / (float) num_nonspam) * 100.0);
+	fprintf (fp,
+			"# Correctly spam:     %6d  %4.2f%%  (%4.2f%% of spam corpus)\n",
+			ga_yy,
+			(ga_yy / (float) num_tests) * 100.0,
+			(ga_yy / (float) num_spam) * 100.0);
+	fprintf (fp,
+			"# False positives:    %6d  %4.2f%%  (%4.2f%% of nonspam)\n",
+			ga_ny,
+			(ga_ny / (float) num_tests) * 100.0,
+			(ga_ny / (float) num_nonspam) * 100.0);
+	fprintf (fp,
+			"# False negatives:    %6d  %4.2f%%  (%4.2f%% of spam)\n",
+			ga_yn,
+			(ga_yn / (float) num_tests) * 100.0,
+			(ga_yn / (float) num_spam) * 100.0);
+
+	fprintf (fp,"# Average score for spam:  %3.3f    nonspam: %3.1f\n",(ynscore+yyscore)/((double)(ga_yn+ga_yy)),(nyscore+nnscore)/((double)(ga_nn+ga_ny)));
+	fprintf (fp,"# Average for false-pos:   %3.3f  false-neg: %3.1f\n",(nyscore/(double)ga_ny),(ynscore/(double)ga_yn));
+
+	fprintf (fp,"# TOTAL:              %6d  %3.2f%%\n\n", num_tests, 100.0);
+
+
+	for (i = 0; i < num_scores; i++) {
+		if ( is_mutatable[i] )  {
+			fprintf(fp, "score %-30s 0 %2.3f\n", score_names[i], weight_to_score(weights[i]));
+		} else {
+			fprintf(fp, "score %-30s 0 %2.3f # not mutable\n", score_names[i], range_lo[i]);
+		}
+	}
+}
+
+/* Computes the value of the activation function of the perceptron for
+ * a given input. */
+double evaluate_test (int test) {
+#ifdef LEAST_SQUARES_ERROR
+	return 1/(1+exp(-evaluate_test_nogain(test)));
+#else
+#ifdef ENTROPIC_ERROR
+	return tanh(evaluate_test_nogain(test));
+#endif
+#endif
+}
+
+/* Computes the value of the transfer function (in this case, linear) for
+ * an input defined in tests_hit[test]* */
+double evaluate_test_nogain (int test) {
+	double sum;
+	int i;
+
+	sum = bias;
+
+	for (i = 0; i < num_tests_hit[test]; i++) {
+		sum += weights[tests_hit[test][i]];
+	}
+
+	/* Translate the 'unmutatable' scores to weight space. */
+	sum += score_to_weight(scores[test]);
+
+	return sum;
+}
+
+/* Trains the perceptron using stochastic gradient descent. */
+void train (int num_epochs, double learning_rate) {
+	int epoch, random_test;
+	int i, j;
+	int * tests;
+	double y_out, error, delta;
+
+	/* Initialize and populate an array containing indices of training
+	 * instances.  This is shuffled on every epoch and then iterated
+	 * through.  I had originally planned to use roulette wheel selection,
+	 * but shuffled selection seems to work better. */
+	tests = (int*)calloc(wheel_size, sizeof(int));
+
+	for (i = 0, j = 0; i < num_nondup-1; i++) {
+		for (; j < roulette_wheel[i+1]; j++) {
+			tests[j] = i;
+		}
+	}
+	for (; j < wheel_size; j++) {
+		tests[j] = num_nondup-1;
+	}
+
+	for (epoch = 0; epoch < num_epochs; epoch++) {
+		/* decay the weights on every epoch to smooth out statistical
+		 * anomalies */
+		if ( weight_decay != 1.0 ) {
+			bias *= weight_decay;
+			for (i = 0; i < num_mutable; i++) {
+				weights[i] *= weight_decay;
+			}
+		}
+
+		/* shuffle the training instances */
+		for (i = 0; i < wheel_size-1; i++) {
+			int tmp;
+			int r = lrand48 () % (wheel_size - i);
+
+			tmp = tests[i];
+			tests[i] = tests[r+i];
+			tests[r+i] = tmp;
+		}
+
+		for (j = 0; j < wheel_size; j++) {
+
+			/* select a random test (they have been randomized above) */
+			random_test = tests[j];
+
+			/* compute the output of the network */
+			y_out = evaluate_test(random_test);
+	
+/* compute the error gradient for the logsig node with least squares error */
+#ifdef LEAST_SQUARES_ERROR
+			error = is_spam[random_test] - y_out;
+			delta = y_out * (1-y_out) * error / (num_tests_hit[random_test]+1) * learning_rate;
+#else
+/* compute the error gradient for the tanh node with entropic error */
+#ifdef ENTROPIC_ERROR
+			error = (2.0*is_spam[random_test]-1) - y_out;
+			delta = error / (num_tests_hit[random_test]+1) * learning_rate;
+#endif
+#endif
+	
+			/* adjust the weights to descend the steepest part of the error gradient */
+			bias += delta;
+			for (i = 0; i < num_tests_hit[random_test]; i++) {
+				int idx = tests_hit[random_test][i];
+				weights[idx] += delta;
+	
+				/* Constrain the weights so that nice rules are always <= 0 etc. */
+				if ( range_lo[idx] >= 0 && weights[idx] < 0 ) {
+					weights[idx] = 0;
+				} else if ( range_hi[idx] <= 0 && weights[idx] > 0 ) {
+					weights[idx] = 0;
+				}
+
+				/*
+				if ( weights[idx] < score_to_weight(range_lo[idx]) ) {
+					weights[idx] = score_to_weight(range_lo[idx]);
+				} else if ( weights[idx] > score_to_weight(range_hi[idx]) ) {
+					weights[idx] = score_to_weight(range_hi[idx]);
+				}
+				*/
+			}
+		}
+	}
+
+	free(tests);
+}
+
+void usage () {
+	printf ("usage: perceptron [args]\n"
+			"\n"
+			"  -p ham_preference = adds extra ham to training set multiplied by number of\n"
+			"                      tests hit (2.0 default)\n"
+			"  -e num_epochs = number of epochs to train (15 default)\n"
+			"  -l learning_rate = learning rate for gradient descent (2.0 default)\n"
+			"  -w weight_decay = per-epoch decay of learned weight and bias (1.0 default)\n"
+			"\n");
+	exit(30);
+}
+
+int main (int argc, char ** argv) {
+	struct timeval tv, tv_start, tv_end;
+	long long int t_usec;
+	FILE * fp;
+	int arg;
+
+	/* Read the command line options */
+	while ((arg = getopt (argc, argv, "p:e:l:w:?")) != -1) {
+		switch (arg) {
+			case 'p':
+				ham_preference = atof(optarg);
+				break;
+
+			case 'e':
+				num_epochs = atoi(optarg);
+				break;
+
+			case 'l':
+				learning_rate = atof(optarg);
+				break;
+
+			case 'w':
+				weight_decay = atof(optarg);
+				break;
+
+			case '?':
+				usage();
+				break;
+		}
+	}
+
+	/* Seed the PRNG */
+	gettimeofday (&tv, 0);
+	t_usec = tv.tv_sec * 1000000 + tv.tv_usec;
+	srand48 ((int)t_usec);
+
+	/* Load the instances and score constraints generated by logs-to-c. */
+	loadtests();
+	loadscores();
+
+	/* Replicate instances from the training set to bias against false positives. */
+	init_wheel ();
+
+	/* Allocate and initialize the weight vector. */
+	init_weights ();
+
+	/* Train the network using stochastic gradient descent. */
+	gettimeofday(&tv_start, 0);
+	train(num_epochs, learning_rate);
+	gettimeofday(&tv_end, 0);
+
+	t_usec = tv_end.tv_sec * 1000000 + tv_end.tv_usec -
+		(tv_start.tv_sec *1000000 + tv_start.tv_usec);
+	printf ("Training time = %fs.\n", t_usec / 1000000.0f);
+
+	/* Translate the learned weights to SA score space and output to a file. */
+	fp = fopen (OUTPUT_FILE, "w");
+	if ( fp ) {
+		write_weights(fp);
+		fclose(fp);
+	} else {
+		perror (OUTPUT_FILE);
+	}
+
+	/* Free resources */
+	destroy_weights ();
+	destroy_wheel ();
+	return 0;
+}