You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2007/02/16 19:30:13 UTC

[Spamassassin Wiki] Update of "RunningGa" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RunningGa

New page:
= Running the GA to generate scores =

Firstly, check that the rules and logs are both relatively clean
and ready to use.

Copy/link the full source logs to "ham-full.log" and "spam-full.log"
in the masses directory. Then:

{{{
cd masses

make clean
rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log
svn revert ../rules/50_scores.cf
ln -s ham-full.log ham.log
ln -s spam-full.log spam.log
make freqs SCORESET=3
less freqs
}}}

Go through the HitFrequencies report in freqs and check:

 * ALL_TRUSTED hitrate on spam.  This should appear only in ham.
 * unfamiliar rules with high ham hitrates; they could be easily forgeable. comment them or mark them "tflags nopublish".
 * NO_RECEIVED hitrate in spam.
 * NO_RELAYS hitrate in spam.

Save a copy of freqs, then generate ranges:

{{{
cp freqs freqs.full
make > make.out 2>&1
less tmp/ranges.data
}}}

examine tmp/ranges.data and check:

 * ranges that are 0.000 0.000 0 for no obvious reason;
 * rules named with a "T_" prefix. These can sometimes slip through if used in promoted meta rules.  They should be fixed to not include a "T_" prefix in the rulesrc source file.  (that should be the only way that a T_ rule will appear in the output; "real" sandbox T_ rules should be removed already, since you deleted the sandbox rule file.)

To prepare your environment for running the rescorer:

{{{
  rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log
  mkdir ORIG
  for CLASS in ham spam ; do
  ln $CLASS-full.log ORIG/$CLASS.log
  for I in 0 1 2 3 ; do
  ln -s $CLASS.log ORIG/$CLASS-set$I.log
  done
  done
}}}

== Score generation ==

Copy a config file from "config.set0"/"set1"/"set2"/"set3" to "config", and
execute the runGA script. runGA generates and uses a randomly selected corpus
with 90% being used for training and 10% being used for testing.

You need to ensure an up-to-date version of perl is used.  On the zone, this
is /local/perl586.

{{{
export PATH=/local/perl586/bin:$PATH
nohup bash runGA &
tail -f nohup.out
}}}


monitor progress... once the GA is compiled, and starts running, if
the FP%/FN% rates are too crappy, it may be worth CTRL-C'ing the runGA
process and running a new one "by hand" with different switches:

{{{
./garescorer -b 5.0 -s 100 -t 5.0
}}}

if you do this though you will have to cut and paste the post-GA
commands (in the "POST-GA COMMANDS" section of runGA) by hand!

Once the GA run is complete, and you're happy with the accuracy: You will find
your results in a directory of the form
"gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE-ga".

Compare the listed FP%/FN% rate on gen-*/test to gen-*/scores;
gen-*/scores is the output from the perceptron, and should match within
a few 0.1% to gen-*/test output (which is computed on a separate subset
of the mail messages).  This checks:

 * that the mail messages are diverse enough to avoid overfitting (hence the different test and train sets)
 * that the FP%/FN% computations are not losing precision due to C-vs-Perl floating-point bugs, or a differing idea of what rules are promoted vs not promoted between the C and Perl code.

Once you're satisfied, check in ../rules/50_scores.cf . add a comment to the rescoring bugzilla bug, noting:

 * the "gen-*/test" file contents, with FP%/FN% rate
 * the "gen-*" path for later reference