You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/07/09 21:31:08 UTC
[Bug 3584] New: Improvements to score learning system

http://bugzilla.spamassassin.org/show_bug.cgi?id=3584

           Summary: Improvements to score learning system
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: GA
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: henry@stern.ca


I have made some enhancements to the score learning system to get us ready for
the 3.0 release:

* masses/perceptron.c
  - fixed bug where ranges were ignored (an #ifdef was backwards)
  - made it so that the network bias does not change during the last iteration,
    ensuring that all scores stay within their ranges
  - scale score ranges when the -t option is used
* masses/runGA
  - enhanced the config file
  - removed the extra svn revert that would cause mk-baseline-results to use
    whichever scoreset was in HEAD
  - all log files go in their own directory now.  keeps things clean when you
    are generating multiple score sets with different parameters.
  - changed all instances of validate to test to reinforce the notion that you
    are using this to create your final model and using the validate-model
    script for parameter selection
  - using random buckets instead of the simple 10-bucket split algorithm since
    we are using the 10-bucket split algorithm for the cross validation and
    have thus been tuning the parameters on the test set (we can be more sure
    of our results this way)
  - took out the threshold rotation stuff
* masses/mk-baseline-results
  - changed validate to test
* masses/validate-model
  - similar to the runGA script, this is for running 10-fold cross validations
  - you control the parameters (ham preference, threshold, set and number of
    epochs) using masses/config
  - results from the 10 runs are aggregated and stored as 
    vm-set-preference-threshold-epochs/validate with the intent that you will
    use the compare-models script to do a statistical analysis of the output
* masses/logs-to-c
  - added --fplog and --fnlog options which allow you to extract the false
    positives and false negatives for later analysis (corpus cleaning, etc.)
* masses/config
  - added HAM_PREFERENCE, THRESHOLD, EPOCHS and NOTE options
* masses/compare-models
  - statistical analysis of output of validate-models script
  - used to compare the output of two runs
  - generates mean FP%, FN% and TCR
  - uses paired-sample t-test with arbitrary confidence level to test whether 
    there is a statistically significant difference between the output of the 
    runs in any of the above-mentioned attributes
  - requires Statistics::Distributions
* masses/extract-results
  - extracts the FP, FN, TP and TN numbers from the output of logs-to-c
  - to be used by validate-model to aggregate results
* masses/tenpass/split-log-into-buckets-random
  - randomised version of split-log-into-buckets
* masses/generate-corpus
  - new script that creates your corpus files from the mass-check dumps on the 
    rsync server.  also clears the cache from validate-model (everything will 
    need to be recompiled)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.