You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by hs...@apache.org on 2005/04/09 12:31:28 UTC
svn commit: r160676 - in spamassassin/trunk/masses/evolve_metarule: ./
Makefile README evolve_metarule.c preproc.pl
Author: hstern
Date: Sat Apr 9 03:31:25 2005
New Revision: 160676
URL: http://svn.apache.org/viewcvs?view=rev&rev=160676
Log:
* masses/evolve_metarule
* masses/evolve_metarule/evolve_metarule.c
* masses/evolve_metarule/preproc.pl
* masses/evolve_metarule/Makefile
* masses/evolve_metarule/README
This program is used to optimize phrase-based meta rules such as
ADVANCE_FEE and NIGERIAN for performance by selecting a subset of the
candidate body rules.
See bug 4248.
Added:
spamassassin/trunk/masses/evolve_metarule/
spamassassin/trunk/masses/evolve_metarule/Makefile
spamassassin/trunk/masses/evolve_metarule/README
spamassassin/trunk/masses/evolve_metarule/evolve_metarule.c
spamassassin/trunk/masses/evolve_metarule/preproc.pl
Added: spamassassin/trunk/masses/evolve_metarule/Makefile
URL: http://svn.apache.org/viewcvs/spamassassin/trunk/masses/evolve_metarule/Makefile?view=auto&rev=160676
==============================================================================
--- spamassassin/trunk/masses/evolve_metarule/Makefile (added)
+++ spamassassin/trunk/masses/evolve_metarule/Makefile Sat Apr 9 03:31:25 2005
@@ -0,0 +1,13 @@
+CFLAGS=-g -O2 -Wall
+LDFLAGS=-lgaul -lgaul_util -lm
+
+all: evolve_metarule hits.dat rules.dat
+
+hits.dat rules.dat: preproc.pl ../../rules/*.cf ../ham.log ../spam.log
+ perl preproc.pl ../ham.log ../spam.log
+
+evolve_metarule: evolve_metarule.o
+ ${CC} ${LDFLAGS} -o $@ $^
+
+clean:
+ rm -f *.o evolve_metarule hits.dat rules.dat
Added: spamassassin/trunk/masses/evolve_metarule/README
URL: http://svn.apache.org/viewcvs/spamassassin/trunk/masses/evolve_metarule/README?view=auto&rev=160676
==============================================================================
--- spamassassin/trunk/masses/evolve_metarule/README (added)
+++ spamassassin/trunk/masses/evolve_metarule/README Sat Apr 9 03:31:25 2005
@@ -0,0 +1,86 @@
+
+Genetic Algorithm Optimizer for Meta Rules
+
+Henry Stern
+Anti-Spam Engineering
+McAfee International
+Alton House
+Gatehouse Way
+Aylesbury, Bucks
+HP19 8YD
+
+April 9, 2005
+
+1. WHAT IS IT?
+
+This program is used to optimize phrase-based meta rules such as
+ADVANCE_FEE and NIGERIAN for performance by selecting a subset of the
+candidate body rules.
+
+It selects rule sets based on combined hit rates, false positive rates and
+a desired number of rules.
+
+This program requres the GAUL: Genetic Algorithm Utility Library which can be
+obtained from http://gaul.sourceforge.net/. It is licensed under the GPL.
+
+2. OPTIONS
+
+Config parameters:
+ -h hits_file
+ Path to the compressed matrix containing the rule hits.
+ Default: hits.dat
+
+ -r rules_fule
+ Path to the file containing the rule names corresponding to columns in the compressed matrix.
+ Default: rules.dat
+
+Fitness function parameters:
+ -m maximum_relevant_hits
+ Stop counting hits after seeing this many. The rule analogue of this
+ is:
+ ADVANCE_FEE_1, ADVANCE_FEE_2, ..., ADVANCE_FEE_m
+ Default: 4
+
+ -t target_num_rules
+ How many sub-rules should be used by the meta rule.
+ Default: 50
+
+ -e hits_exponent
+ Parameter to the fitness function, how the importance of high numbers of
+ hits is.
+ Default: 3.0
+
+ -p penalty_exponent
+ If rules hit ham, this exponential penalty is applied based on the
+ number of hits.
+ Default: 9.0
+
+GA parameters:
+ -s population_size
+ How many individuals should be used in the simulation.
+ Default: 100
+
+ -g max_generations
+ How many geenrations that the simulation should run for.
+ Default: 10000
+
+ -x crossover_prob
+ The probability of an allele-mixing cross-over.
+ Default: 1.0
+
+ -u mutation_prob
+ The probability of a one-allele mutation.
+ Default: 0.1
+
+3. HOW DOES IT WORK?
+
+For every generation, "Parents" are selected based on their fitness. Parents
+are "mated" to produce two children. The alleles on each chromosome are
+randomly selected, which isn't very biologically plausible, but it works.
+
+Fitness of an individual is evaluated based on hit rates, false positive rates
+and how close the individual is to the target number of rules.
+
+--
+hs
+9/5/2005
Added: spamassassin/trunk/masses/evolve_metarule/evolve_metarule.c
URL: http://svn.apache.org/viewcvs/spamassassin/trunk/masses/evolve_metarule/evolve_metarule.c?view=auto&rev=160676
==============================================================================
--- spamassassin/trunk/masses/evolve_metarule/evolve_metarule.c (added)
+++ spamassassin/trunk/masses/evolve_metarule/evolve_metarule.c Sat Apr 9 03:31:25 2005
@@ -0,0 +1,337 @@
+/* This program uses a genetic algorithm to optimize a phrase-based rule such as
+ * the NIGERIAN or ADVANCE_FEE rule.
+ *
+ * <@LICENSE>
+ * Copyright 2005 Apache Software Foundation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * </...@LICENSE>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <time.h>
+#include <unistd.h>
+#include <strings.h>
+
+/* GAUL: Genetic Algorithm Utility Library. http://gaul.sourceforge.net/ */
+#include <gaul.h>
+
+/* Config files */
+char * hits_file = "hits.dat"; /* The data file containing the matrix. */
+char * rules_file = "rules.dat"; /* The data file containing rule names. */
+
+/* Fitness function parameters */
+int maximum_relevant_hits = 4; /* How many hits is the rule going to look for. */
+int target_num_rules = 50; /* How many sub-rules would we like the meta rule to use? */
+
+/* The fitness function is based on:
+ * min(num_hits, maximum_relevant_hits)^hits_exponent * count * ...
+ * (if ham: -num_hits ^ penalty_exponent) */
+double hits_exponent = 3.0;
+double penalty_exponent = 9.0;
+
+/* GA parameters */
+/* The number of individuals in the population */
+int population_size = 100;
+/* The maximal number of generations that the simluation is to run for. */
+int max_generations = 10000;
+/* The probability of a cross-over. */
+double crossover_prob = 1.0;
+/* The probability of one boolean allele being switched. */
+double mutation_prob = 0.1;
+
+int num_rules; /* The number of rules being optimized. */
+int max_hits; /* The maximal number of hits in a unique pattern. */
+int num_patterns; /* The number of unique patterns. */
+int * pattern_data; /* A compressed matrix containing the patterns. */
+int * pattern_size_data; /* The width of each row in the above matrix. */
+int * pattern_count_data; /* The number of occurrences of the pattern. */
+int * class_data; /* The class that each pattern belongs to (ham, spam). */
+
+/* Some #defines for fast access to the compressed matrix. */
+#define pattern(i,j) (pattern_data[(i)*max_hits + (j)])
+#define pattern_size(i) (pattern_size_data[(i)])
+#define pattern_count(i) (pattern_count_data[(i)])
+#define class(i) (class_data[(i)])
+
+#define max(x,y) ((x)>(y)?(x):(y))
+#define min(x,y) ((x)<(y)?(x):(y))
+
+/* Loads the compressed matrix into memory. */
+void load_patterns () {
+ FILE * pfile;
+ int p;
+
+ pfile = fopen(hits_file, "r");
+ if ( ! pfile ) {
+ perror (hits_file);
+ exit(1);
+ }
+
+ if ( fscanf (pfile, "%d %d %d", &num_rules, &max_hits, &num_patterns) != 3 ) {
+ fprintf (stderr, "%s missing header.\n", hits_file);
+ exit (1);
+ }
+
+ assert(pattern_data = (int*)malloc(sizeof(int) * max_hits * num_patterns));
+ assert(pattern_size_data = (int*)malloc(sizeof(int) * num_patterns));
+ assert(pattern_count_data = (int*)malloc(sizeof(int) * num_patterns));
+ assert(class_data = (int*)malloc(sizeof(int) * num_patterns));
+
+ for (p = 0; p < num_patterns; p++) {
+ int i;
+
+ if (fscanf(pfile, "%d %d %d", class_data+p, pattern_count_data+p, pattern_size_data+p) != 3) {
+ fprintf (stderr, "%s truncated (entry %d)\n", hits_file, p);
+ exit (1);
+ }
+ assert(pattern_size_data[p] <= max_hits);
+ for (i = 0; i < pattern_size(p); i++) {
+ if (fscanf(pfile, "%d", pattern_data+p*max_hits+i) != 1) {
+ fprintf (stderr, "%s truncated (entry %d)\n", hits_file, p);
+ exit(1);
+ }
+ }
+ }
+}
+
+/* The fitness function is:
+ * sum_{all individuals}
+ * min(num_hits, maximum_relevant_hits)^hits_exponent * count * ...
+ * (if ham: -num_hits ^ penalty_exponent)
+ * / exp(abs(target_num_rules - num_rules_present) / exp(1))
+ * */
+static boolean pattern_score(population *pop, entity *entity) {
+ int i, j, num_hits, num_rules_present;
+
+ entity->fitness = 0;
+
+ /* Count up the number of rules present in this individual's
+ * chromosome. */
+ num_rules_present = 0;
+ for (i = 0; i < num_rules; i++) {
+ if (((boolean*)entity->chromosome[0])[i]) {
+ num_rules_present++;
+ }
+ }
+
+ /* An individual with no rules present in its chromosome has 0 fitness,
+ * so we take a short cut. */
+ if (num_rules_present == 0) {
+ return true;
+ }
+
+ /* Compute the fitness function as described above. */
+ for (i = 0; i < num_patterns; i++) {
+ num_hits = 0;
+ /* This counts how many rules in the chromosome also hit the
+ * pattern */
+ for (j = 0; j < pattern_size(i); j++) {
+ if (((boolean*)entity->chromosome[0])[pattern(i,j)]) {
+ num_hits++;
+ }
+ }
+
+ /* See above for a description of what this does and why it
+ * does it. */
+ entity->fitness += pow(min(num_hits,maximum_relevant_hits),hits_exponent) * pattern_count(i) * (class(i) ? 1 : -pow(num_hits,penalty_exponent));
+ }
+
+ /* This divisor is bound to 1, to prevent overflow. exp(0) is undefined
+ * so we just skip this part (it's unnecessary). */
+ if ( target_num_rules - num_rules_present ) {
+ entity->fitness /= max(exp(fabs(target_num_rules - num_rules_present) / exp(1)),1);
+ }
+
+ /* Negative fitnesses make roulette wheels go owwie. */
+ if ( entity->fitness < 0 ) {
+ entity->fitness = 0;
+ }
+
+ return true;
+}
+
+/* This is to print out the final result. */
+void print_entity (entity * entity) {
+ FILE * rfile;
+ char buf[BUFSIZ];
+ int i;
+ int count = 0;
+ int histogram[2][maximum_relevant_hits+1], num_hits;
+
+ assert(rfile = fopen (rules_file, "r"));
+
+ /* The rules in rules.dat are supposed to correspond to column
+ * numbers. */
+ for (i = 0; i < num_rules; i++) {
+ if ( ! fgets(buf, BUFSIZ, rfile) ) {
+ perror ("fgets");
+ exit (1);
+ }
+
+ /* Print out the rule name if the corresponding allele is
+ * present on the chromosome. */
+ if ( ((boolean *)entity->chromosome[0])[i] == 1 ) {
+ count++;
+ printf ("%s", buf);
+ }
+ }
+
+ printf ("fitness: %f\n", entity->fitness);
+ printf ("rule count: %d\n", count);
+
+ /* Zero the histogram, just in case the compiler han't done it for us. */
+ bzero (histogram, sizeof(histogram));
+
+ /* Compute the histogram by scanning through the training data. */
+ for (i = 0; i < num_patterns; i++) {
+ int j;
+
+ num_hits = 0;
+ for (j = 0; j < pattern_size(i); j++) {
+ if (((boolean*)entity->chromosome[0])[pattern(i,j)]) {
+ num_hits++;
+ if ( num_hits == maximum_relevant_hits ) {
+ break;
+ }
+ }
+ }
+ for (j = 0; j <= num_hits; j++) {
+ histogram[class(i)][j] += pattern_count(i);
+ }
+ }
+
+ /* Print the histogram. */
+ printf ("\t %8s %8s %8s %8s %8s\n",
+ "HAM",
+ "HAM%",
+ "SPAM",
+ "SPAM%",
+ "S/O");
+ for (i = 0; i <= maximum_relevant_hits; i++) {
+ printf (">=%d hits:%8d %8.4f %8d %8.4f %8.4f\n", i,
+ histogram[0][i],
+ 100.0 * histogram[0][i] / histogram[0][0],
+ histogram[1][i],
+ 100.0 * histogram[1][i] / histogram[1][0],
+ ((double)histogram[1][i] / histogram[1][0]) / ((double)histogram[1][i] / histogram[1][0] + (double)histogram[0][i] / histogram[0][0]));
+ }
+}
+
+void usage () {
+ printf ("usage: evolve_metarule [args]\n"
+ "\n"
+ "Config parameters:\n"
+ " -h hits_file\n"
+ " -r rules_fule\n"
+ "\nFitness function parameters:\n"
+ " -m maximum_relevant_hits\n"
+ " -t target_num_rules\n"
+ " -e hits_exponent\n"
+ " -p penalty_exponent\n"
+ "\nGA parameters:\n"
+ " -s population_size\n"
+ " -g max_generations\n"
+ " -x crossover_prob\n"
+ " -u mutation_prob\n"
+ "\n -? = print this help\n"
+ "\n");
+
+ exit(0);
+}
+
+int main (int argc, char ** argv) {
+ population *pop = 0;
+ char arg;
+
+ while ((arg = getopt (argc, argv, "h:r:m:t:e:p:s:g:x:u:?")) != -1) {
+ switch (arg) {
+ case 'h':
+ hits_file = optarg;
+ break;
+ case 'r':
+ rules_file = optarg;
+ break;
+ case 'm':
+ maximum_relevant_hits = atoi(optarg);
+ break;
+ case 't':
+ target_num_rules = atoi(optarg);
+ break;
+ case 'e':
+ hits_exponent = atof(optarg);
+ break;
+ case 'p':
+ penalty_exponent = atof(optarg);
+ break;
+ case 's':
+ population_size = atoi(optarg);
+ break;
+ case 'g':
+ max_generations = atoi(optarg);
+ break;
+ case 'x':
+ crossover_prob = atof(optarg);
+ break;
+ case 'u':
+ mutation_prob = atof(optarg);
+ break;
+ case '?':
+ usage ();
+ }
+ }
+
+ load_patterns();
+
+ random_init();
+ random_seed(time(0));
+
+ pop = ga_genesis_boolean(
+ population_size, /* const int population_size */
+ 1, /* const int num_chromo */
+ num_rules, /* const int len_chromo */
+ NULL, /* GAgeneration_hook generation_hook */
+ NULL, /* GAiteration_hook iteration_hook */
+ NULL, /* GAdata_destructor data_destructor */
+ NULL, /* GAdata_ref_incrementor data_ref_incrementor */
+ pattern_score, /* GAevaluate evaluate */
+ ga_seed_boolean_random, /* GAseed seed */
+ NULL, /* GAadapt adapt */
+ ga_select_one_roulette, /* GAselect_one select_one */
+ ga_select_two_roulette, /* GAselect_two select_two */
+ ga_mutate_boolean_singlepoint, /* GAmutate mutate */
+ ga_crossover_boolean_allele_mixing, /* GAcrossover crossover */
+ ga_replace_by_fitness, /* GAreplace replace */
+ NULL /* vpointer User data */
+ );
+
+ ga_population_set_parameters(
+ pop, /* population *pop */
+ GA_SCHEME_DARWIN, /* const ga_scheme_type scheme */
+ GA_ELITISM_NULL, /* const ga_elitism_type elitism */
+ crossover_prob, /* double crossover */
+ mutation_prob, /* double mutation */
+ 0.0 /* double migration */
+ );
+
+ ga_evolution_steady_state(
+ pop, /* population *pop */
+ max_generations /* const int max_generations */
+ );
+
+ print_entity(ga_get_entity_from_rank(pop, 0));
+
+ return 0;
+}
Added: spamassassin/trunk/masses/evolve_metarule/preproc.pl
URL: http://svn.apache.org/viewcvs/spamassassin/trunk/masses/evolve_metarule/preproc.pl?view=auto&rev=160676
==============================================================================
--- spamassassin/trunk/masses/evolve_metarule/preproc.pl (added)
+++ spamassassin/trunk/masses/evolve_metarule/preproc.pl Sat Apr 9 03:31:25 2005
@@ -0,0 +1,126 @@
+#!/usr/bin/perl -w
+# <@LICENSE>
+# Copyright 2005 Apache Software Foundation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# </...@LICENSE>
+#
+# Produces an output file containing a sparse matrix to be loaded into
+# evolve_metarules. This particular configuration looks for the __FRAUD_AAA
+# rules, but you can change the regex to be whatever you want.
+#
+# Usage: preproc.pl {ham.log} {spam.log} {rules.dat} {hits.dat}
+#
+# Output file format for rules.dat:
+# rule_name
+# ...
+#
+# Output file format (unsigned ascii integers) for hits.dat:
+# num_rules
+# max_hits
+# num_patterns
+# is_spam pattern_count pattern_size (rule_no){pattern_size}
+# ...
+
+use strict;
+
+# Search for matching rules in the SpamAssassin rules directory.
+my %rules;
+open (RULE_OUT, ">", $ARGV[2] || "rules.dat") || die $!;
+foreach my $file (<../../rules/*.cf>) {
+ open (CONFIG, "<", $file) || die $!;
+ while (<CONFIG>) {
+ if (/^(?:header|body|uri|rawbody|full|meta)\s+(__FRAUD_[A-Z]{3})\s/) {
+ $rules{$1} = (scalar keys %rules);
+ printf RULE_OUT "%s\n", $1;
+ }
+ }
+ close (CONFIG) || die $!;
+}
+close (RULE_OUT) || die $!;
+
+# This is to find the pattern hitting the most rules.
+my $largest_pattern = 0;
+
+# ham_patterns: Hash containing all of the unique ham patterns that we have
+# seen so far and how many of each we have seen.
+# ham_pattern_len: How many entries are in each pattern. This is really only
+# here so that I can be lazy later.
+my (%ham_patterns, %ham_pattern_len);
+open (HAM, "<", $ARGV[0] || "ham.log" ) || die $!;
+while (<HAM>) {
+ # Ignore comments.
+ next if /^#/;
+
+ # Rule hits are in the fourth field.
+ my (undef,undef,undef, $test_str, undef) = split /\s/;
+
+ # Extract the relevant rule hits and sort them by column number.
+ my @hits = sort map { $rules{$_} } grep { exists $rules{$_} } split /,/, $test_str;
+
+ # Count the number of occurrences and size of this pattern.
+ $ham_patterns{join (' ', @hits)}++;
+ $ham_pattern_len{join (' ', @hits)} = scalar(@hits);
+
+ # Keep track of the largest pattern that we have seen thus far.
+ if ( scalar(@hits) > $largest_pattern) {
+ $largest_pattern = scalar(@hits);
+ }
+}
+close (HAM);
+delete $ham_patterns{''};
+
+# spam_patterns: Hash containing all of the unique spam patterns that we have
+# seen so far and how many of each we have seen.
+# spam_pattern_len: How many entries are in each pattern. This is really only
+# here so that I can be lazy later.
+my (%spam_patterns, %spam_pattern_len);
+open (SPAM, "<", $ARGV[1] || "spam.log") || die $!;
+while (<SPAM>) {
+ # Ignore comments.
+ next if /^#/;
+
+ # Rule hits are in the fourth field.
+ my (undef,undef,undef, $test_str, undef) = split /\s/;
+
+ # Extract the relevant rule hits and sort them by column number.
+ my @hits = sort map { $rules{$_} } grep { exists $rules{$_} } split /,/, $test_str;
+
+ # Count the number of occurrences and size of this pattern.
+ $spam_patterns{join (' ', @hits)}++;
+ $spam_pattern_len{join (' ', @hits)} = scalar(@hits);
+
+ # Keep track of the largest pattern that we have seen thus far.
+ if ( scalar(@hits) > $largest_pattern) {
+ $largest_pattern = scalar(@hits);
+ }
+}
+close (SPAM);
+delete $spam_patterns{''};
+
+# Write things out to the data file in the format mentioned above.
+open (DAT, ">", $ARGV[3] || "hits.dat") || die $!;
+
+printf DAT "%d\n", scalar(keys %rules);
+printf DAT "%d\n", $largest_pattern;
+printf DAT "%d\n", scalar(keys %ham_patterns) + scalar(keys %spam_patterns);
+
+foreach my $pattern (keys %ham_patterns) {
+ printf DAT "0 %d %d %s\n", $ham_patterns{$pattern}, $ham_pattern_len{$pattern}, $pattern;
+}
+
+foreach my $pattern (keys %spam_patterns) {
+ printf DAT "1 %d %d %s\n", $spam_patterns{$pattern}, $spam_pattern_len{$pattern}, $pattern;
+}
+
+close (DAT);