You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/09 00:16:34 UTC

[Spamassassin Wiki] Update of "RunningPerceptron" by HenryStern

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by HenryStern:
http://wiki.apache.org/spamassassin/RunningPerceptron

------------------------------------------------------------------------------
  = Running the Perceptron to generate scores =
  
- If all goes well, the ["Perceptron"] will take over from the GeneticAlgorithm (GA) as the main way we generate scores.  (This text was copied from RescoreTenFcv and needs editing.)
+ Generating scores is a two step process:  model validation and score generation.  To prepare your environment for running perceptron, execute the following after setting CORPUS:
  
- Change these lines:
  {{{
-   make clean >> make.output
-   make >> make.output 2>&1
-   ./evolve
-   pwd; date
+   mkdir masses/ORIG -p
+   for CLASS in "ham spam"; do
+     cat $CORPUS/submit/$CLASS*.log > masses/ORIG/$CLASS.log
+     for I in $(seq 0 3); do
+       ln -s masses/ORIG/$CLASS.log masses/ORIG/$CLASS-set$I.log
+     done
+   done
  }}}
  
- to
+ == Model validation ==
  
+ Before generating the final set of scores, you need to pick a configuration for the training program.  In order to do this, you run a series of "ten-fold cross validations" and use "Student's t-test" to compare their results.  Should you be so inclined, you can also use ANOVA to compare result sets.  This is left as an exercise to the reader.
- {{{
-   make clean >> make.output
-   make -C perceptron_c clean >> make.output
-   make tmp/tests.h >> make.output 2>&1
-   rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
-   make -C perceptron_c >> make.output
-   ( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
-   pwd; date
- }}}
  
- Change
+ In the masses directory, you will find a file called "config."  As its name suggests, this is the configuration for the validate-model (and runGA) script.  It consists of 5 fields:  SCORESET, HAM_PREFERENCE, THRESHOLD, EPOCHS and NOTE.  SCORESET is an integer between 0 and 3.  Set 0 is for the ruleset with bayes and network tests disabled.  Set 1 is for the ruleset with network tests enabled.  Set 2 is for the ruleset with bayes enabled.  Set 3 is for the ruleset with network tests and bayes enabled.  HAM_PREFERENCE, THRESHOLD and EPOCHS correspond to options passed to perceptron.  See its documentation.  NOTE is appended to the name of the directory containing the result sets.
  
+ To refine your parameters, do an iterative process of editing the config file and then running validate-model.  To compare the results of two runs using Student's t-test, use the "compare-models" script.  Each result set will be stored in a directory of the form "vm-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE" and contains a file called "validate" which contains the aggregated results from the cross-validation.  Run compare-models like so:  ./compare-models vm-set0-2.0-5.0-100-before/validate vm-set0-2.0-5.0-100-after/validate
- {{{
-   cp craig-evolve.scores [output]
- }}}
  
- to
+ To speed things up, validate-models caches most of the compiled files.  If you change your logs or any of the scripts that are used as part of compilation, you will need to rm -rf vm-cache.
  
+ == Score generation ==
- {{{
-   perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \
-       < perceptron_c/perceptron.scores \
-       > [output]
- }}}
  
- (required to work around an extra digit output by the perceptron app).
+ When you are happy with your configuration, set it in your config file and execute the runGA script.  You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE"  runGA uses a randomly selected corpus with 90% being used for training and 10% being used for testing.