You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/09/10 03:18:00 UTC

Re: BAYES_* scores - non-monotonic?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi Alan!

Alan Schwartz writes:
> Lately, running SA 3.0.0 with no rule or score tuning, I have been
> noticing that my false negatives tend to have BAYES_99 matched.
> 
> The scores file lists the following scores for Bayes:
> 
> 50_scores.cf:score BAYES_00 0 0 -1.665 -2.599
> 50_scores.cf:score BAYES_05 0 0 -0.925 -0.413
> 50_scores.cf:score BAYES_20 0 0 -0.730 -1.951
> 50_scores.cf:score BAYES_40 0 0 -0.276 -1.096
> 50_scores.cf:score BAYES_50 0 0 1.567 0.001
> 50_scores.cf:score BAYES_60 0 0 3.515 0.372
> 50_scores.cf:score BAYES_80 0 0 3.608 2.087
> 50_scores.cf:score BAYES_95 0 0 3.514 2.063
> 50_scores.cf:score BAYES_99 0 0 4.070 1.886
> 
> I realize that these scores come out of the automated algorithm,
> but they are not sensible on their face, and suggest a potential
> problem with the Bayesian classifier's operation or the mass
> check.
> 
> Note that even without network tests, BAYES_95 < BAYES_80, BAYES_60
> With network tests, BAYES_05 is > BAYES_20, BAYES_40, and 
> BAYES_99 < BAYES_95 < BAYES_80.
> 
> It would not be unreasonable to constraint the BAYES_* scores
> so that they are always monotonic in the predicted probability of
> spam. This constraint would likely cause the scores associated with
> other rules to change slightly, but might not reduce the overall
> accuracy of SA in the mass check corpus (perhaps you're in some
> kind of local minimum?)

Yeah, we've noticed that  -- if I recall correctly, generally it
*doesn't* seem to work out better to constrain them; possibly
because the BAYES_99 spam is already hitting many other rules.
The score generation tries to minimise rule scores without
losing hits, to avoid FPs having major effects.

I think we tried locked-down BAYES scores, and found *lower* overall
accuracy figures.

I'm not certain, though...

- --j.

> I hope this makes sense. I'd be very interested in hearing about
> other experiences with this.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>                        Alan Schwartz <al...@uic.edu>
> Author/Co-author of: "Managing Mailing Lists", "SpamAssassin", 
> "Stopping Spam", and  "Practical Unix & Internet Security, 3rd Ed"
>            Published by O'Reilly Media, Inc. (http://www.oreilly.com)
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBQQDIQTcbUG5Y7woRAit3AKDqtZpmU+8sOJOM7if0uBpqcR3eZgCfTJhN
hwCJk16py5hr7wNEsL1U6OI=
=kcP1
-----END PGP SIGNATURE-----