You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/03/12 14:48:10 UTC
Re: Better score generation tool
hi Duncan --
that *is* good news ;) can you give a rough idea of what algorithm
it uses?
I'm keen to see results once the "rules" are taken into account, btw,
as it's quite easy for machine-learning systems to overfit against
our training data in my experience otherwise, and/or to produce
exploitable "holes" by offering negative scores for easily-forged
rules.
still, very cool!
--j.
Duncan Findlay writes:
> Good news, everyone!
>
> As part of our 4th year Math & Engineering Design Project, Steven Birk
> and I have been working to develop a better scoring algorithm for
> SpamAssassin.
>
> We've come across an algorithm that shows some great promise:
>
> Using the 3.2.0 logs:
>
> scoreset 0:
>
> # SUMMARY for threshold 5.0:
> # Correctly non-spam: 67528 99.97%
> # Correctly spam: 100519 84.41%
> # False positives: 22 0.03%
> # False negatives: 18564 15.59%
> # TCR(l=50): 6.055889 SpamRecall: 84.411% SpamPrec: 99.978%
>
> # SUMMARY for threshold 3.5:
> # Correctly non-spam: 67446 99.85%
> # Correctly spam: 108479 91.10%
> # False positives: 104 0.15%
> # False negatives: 10604 8.90%
> # TCR(l=50): 7.534991 SpamRecall: 91.095% SpamPrec: 99.904%
>
> scoreset 1:
>
> # SUMMARY for threshold 5.0:
> # Correctly non-spam: 67498 99.92%
> # Correctly spam: 112670 94.61%
> # False positives: 52 0.08%
> # False negatives: 6413 5.39%
> # TCR(l=50): 13.212360 SpamRecall: 94.615% SpamPrec: 99.954%
>
> scoreset 2:
>
> # SUMMARY for threshold 5.0:
> # Correctly non-spam: 67517 99.95%
> # Correctly spam: 115916 97.34%
> # False positives: 33 0.05%
> # False negatives: 3167 2.66%
> # TCR(l=50): 24.721403 SpamRecall: 97.341% SpamPrec: 99.972%
>
> scoreset 3:
>
> # SUMMARY for threshold 5.0:
> # Correctly non-spam: 67518 99.95%
> # Correctly spam: 117809 98.93%
> # False positives: 32 0.05%
> # False negatives: 1274 1.07%
> # TCR(l=50): 41.434586 SpamRecall: 98.930% SpamPrec: 99.973%
>
> # SUMMARY for threshold 5.2:
> # Correctly non-spam: 67521 99.96%
> # Correctly spam: 117727 98.86%
> # False positives: 29 0.04%
> # False negatives: 1356 1.14%
> # TCR(l=50): 42.438703 SpamRecall: 98.861% SpamPrec: 99.975%
>
> These are using the same training and validation sets as bug 5270. The
> run time is roughly of the same order of magnitude as the
> perceptron. (The slow bit is the analog of the logs-to-c script.)
>
> Clearly from the set 0 results, we need to tune the algorithm some
> more to get the threshold of 5.0 to be optimal.
>
> At this point, the algorithm breaks a number of our current score
> generation "rules", so there is room for improvement. (We're working
> on it).
>
> - Our handling of immutable rules is pretty much broken at this
> point. (We assume all rules are mutable, evaluate the optimal
> threshold value and scale our scores appropriately, and then only
> update the mutable scores for evaluating against the validation
> set. For our purposes, we also assumed BAYES_* is mutable.) I'm not
> sure how hard this will be to fix, or if it's worth it.
>
> - We have no concept of max/min scores or score ranges. Many tests
> get small negative scores and should simply be set to 0. We haven't
> yet figured out what effect this has on the TCR. Also, some scores get
> set really high -- i.e. BAYES_99 is scored 6.1 in scoreset 3. I'm not
> sure people are comfortable with that. There's at least 2 ways we can
> fix this -- adapting the algorithm to take into account min/max scores
> (hard), simply capping the scores after they are generated (easy). A
> quick look through the scores and score-ranges-from-freqs output
> suggests that this will not hurt our performance all that much.
>
> Our project is due in a few weeks, and with any luck we'll have a
> complete new score generation system for SpamAssassin.
>
> --
> Duncan Findlay
Re: Better score generation tool
Posted by Duncan Findlay <du...@debian.org>.
On Mon, Mar 12, 2007 at 01:48:10PM +0000, Justin Mason wrote:
> that *is* good news ;) can you give a rough idea of what algorithm
> it uses?
It's basically a logistic regression algorithm, but optimized for
binary data. It's called Truncated Regularized Iteratively Reweighted
Least Squares (TR-IRLS).
I'll see if I can get some spare time to at least provide valid scores
that I've optimized (once I work out the min/max bits), even if I
can't commit my scripts yet.
--
Duncan Findlay