You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/07/09 21:31:08 UTC
[Bug 3584] New: Improvements to score learning system
http://bugzilla.spamassassin.org/show_bug.cgi?id=3584
Summary: Improvements to score learning system
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P5
Component: GA
AssignedTo: spamassassin-dev@incubator.apache.org
ReportedBy: henry@stern.ca
I have made some enhancements to the score learning system to get us ready for
the 3.0 release:
* masses/perceptron.c
- fixed bug where ranges were ignored (an #ifdef was backwards)
- made it so that the network bias does not change during the last iteration,
ensuring that all scores stay within their ranges
- scale score ranges when the -t option is used
* masses/runGA
- enhanced the config file
- removed the extra svn revert that would cause mk-baseline-results to use
whichever scoreset was in HEAD
- all log files go in their own directory now. keeps things clean when you
are generating multiple score sets with different parameters.
- changed all instances of validate to test to reinforce the notion that you
are using this to create your final model and using the validate-model
script for parameter selection
- using random buckets instead of the simple 10-bucket split algorithm since
we are using the 10-bucket split algorithm for the cross validation and
have thus been tuning the parameters on the test set (we can be more sure
of our results this way)
- took out the threshold rotation stuff
* masses/mk-baseline-results
- changed validate to test
* masses/validate-model
- similar to the runGA script, this is for running 10-fold cross validations
- you control the parameters (ham preference, threshold, set and number of
epochs) using masses/config
- results from the 10 runs are aggregated and stored as
vm-set-preference-threshold-epochs/validate with the intent that you will
use the compare-models script to do a statistical analysis of the output
* masses/logs-to-c
- added --fplog and --fnlog options which allow you to extract the false
positives and false negatives for later analysis (corpus cleaning, etc.)
* masses/config
- added HAM_PREFERENCE, THRESHOLD, EPOCHS and NOTE options
* masses/compare-models
- statistical analysis of output of validate-models script
- used to compare the output of two runs
- generates mean FP%, FN% and TCR
- uses paired-sample t-test with arbitrary confidence level to test whether
there is a statistically significant difference between the output of the
runs in any of the above-mentioned attributes
- requires Statistics::Distributions
* masses/extract-results
- extracts the FP, FN, TP and TN numbers from the output of logs-to-c
- to be used by validate-model to aggregate results
* masses/tenpass/split-log-into-buckets-random
- randomised version of split-log-into-buckets
* masses/generate-corpus
- new script that creates your corpus files from the mass-check dumps on the
rsync server. also clears the cache from validate-model (everything will
need to be recompiled)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.