You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/03/12 14:48:10 UTC

Re: Better score generation tool

hi Duncan --

that *is* good news ;)   can you give a rough idea of what algorithm
it uses?

I'm keen to see results once the "rules" are taken into account, btw,
as it's quite easy for machine-learning systems to overfit against
our training data in my experience otherwise, and/or to produce
exploitable "holes" by offering negative scores for easily-forged
rules.

still, very cool!

--j.

Duncan Findlay writes:
> Good news, everyone!
> 
> As part of our 4th year Math & Engineering Design Project, Steven Birk
> and I have been working to develop a better scoring algorithm for
> SpamAssassin.
> 
> We've come across an algorithm that shows some great promise:
> 
> Using the 3.2.0 logs:
> 
> scoreset 0:
> 
> # SUMMARY for threshold 5.0:
> # Correctly non-spam:  67528  99.97%
> # Correctly spam:     100519  84.41%
> # False positives:        22  0.03%
> # False negatives:     18564  15.59%
> # TCR(l=50): 6.055889  SpamRecall: 84.411%  SpamPrec: 99.978%
> 
> # SUMMARY for threshold 3.5:
> # Correctly non-spam:  67446  99.85%
> # Correctly spam:     108479  91.10%
> # False positives:       104  0.15%
> # False negatives:     10604  8.90%
> # TCR(l=50): 7.534991  SpamRecall: 91.095%  SpamPrec: 99.904%
> 
> scoreset 1:
> 
> # SUMMARY for threshold 5.0:
> # Correctly non-spam:  67498  99.92%
> # Correctly spam:     112670  94.61%
> # False positives:        52  0.08%
> # False negatives:      6413  5.39%
> # TCR(l=50): 13.212360  SpamRecall: 94.615%  SpamPrec: 99.954%
> 
> scoreset 2:
> 
> # SUMMARY for threshold 5.0:
> # Correctly non-spam:  67517  99.95%
> # Correctly spam:     115916  97.34%
> # False positives:        33  0.05%
> # False negatives:      3167  2.66%
> # TCR(l=50): 24.721403  SpamRecall: 97.341%  SpamPrec: 99.972%
> 
> scoreset 3:
> 
> # SUMMARY for threshold 5.0:
> # Correctly non-spam:  67518  99.95%
> # Correctly spam:     117809  98.93%
> # False positives:        32  0.05%
> # False negatives:      1274  1.07%
> # TCR(l=50): 41.434586  SpamRecall: 98.930%  SpamPrec: 99.973%
> 
> # SUMMARY for threshold 5.2:
> # Correctly non-spam:  67521  99.96%
> # Correctly spam:     117727  98.86%
> # False positives:        29  0.04%
> # False negatives:      1356  1.14%
> # TCR(l=50): 42.438703  SpamRecall: 98.861%  SpamPrec: 99.975%
> 
> These are using the same training and validation sets as bug 5270. The
> run time is roughly of the same order of magnitude as the
> perceptron. (The slow bit is the analog of the logs-to-c script.)
> 
> Clearly from the set 0 results, we need to tune the algorithm some
> more to get the threshold of 5.0 to be optimal.
> 
> At this point, the algorithm breaks a number of our current score
> generation "rules", so there is room for improvement. (We're working
> on it).
> 
>  - Our handling of immutable rules is pretty much broken at this
> point. (We assume all rules are mutable, evaluate the optimal
> threshold value and scale our scores appropriately, and then only
> update the mutable scores for evaluating against the validation
> set. For our purposes, we also assumed BAYES_* is mutable.) I'm not
> sure how hard this will be to fix, or if it's worth it.
> 
>  - We have no concept of max/min scores or score ranges. Many tests
> get small negative scores and should simply be set to 0. We haven't
> yet figured out what effect this has on the TCR. Also, some scores get
> set really high -- i.e. BAYES_99 is scored 6.1 in scoreset 3. I'm not
> sure people are comfortable with that. There's at least 2 ways we can
> fix this -- adapting the algorithm to take into account min/max scores
> (hard), simply capping the scores after they are generated (easy). A
> quick look through the scores and score-ranges-from-freqs output
> suggests that this will not hurt our performance all that much.
> 
> Our project is due in a few weeks, and with any luck we'll have a
> complete new score generation system for SpamAssassin.
> 
> -- 
> Duncan Findlay

Re: Better score generation tool

Posted by Duncan Findlay <du...@debian.org>.
On Mon, Mar 12, 2007 at 01:48:10PM +0000, Justin Mason wrote:
> that *is* good news ;)   can you give a rough idea of what algorithm
> it uses?

It's basically a logistic regression algorithm, but optimized for
binary data. It's called Truncated Regularized Iteratively Reweighted
Least Squares (TR-IRLS).

I'll see if I can get some spare time to at least provide valid scores
that I've optimized (once I work out the min/max bits), even if I
can't commit my scripts yet.

-- 
Duncan Findlay