You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Alan Schwartz <al...@uic.edu> on 2004/09/09 16:45:06 UTC

BAYES_* scores - non-monotonic?

Lately, running SA 3.0.0 with no rule or score tuning, I have been
noticing that my false negatives tend to have BAYES_99 matched.

The scores file lists the following scores for Bayes:

50_scores.cf:score BAYES_00 0 0 -1.665 -2.599
50_scores.cf:score BAYES_05 0 0 -0.925 -0.413
50_scores.cf:score BAYES_20 0 0 -0.730 -1.951
50_scores.cf:score BAYES_40 0 0 -0.276 -1.096
50_scores.cf:score BAYES_50 0 0 1.567 0.001
50_scores.cf:score BAYES_60 0 0 3.515 0.372
50_scores.cf:score BAYES_80 0 0 3.608 2.087
50_scores.cf:score BAYES_95 0 0 3.514 2.063
50_scores.cf:score BAYES_99 0 0 4.070 1.886

I realize that these scores come out of the automated algorithm,
but they are not sensible on their face, and suggest a potential
problem with the Bayesian classifier's operation or the mass
check.

Note that even without network tests, BAYES_95 < BAYES_80, BAYES_60
With network tests, BAYES_05 is > BAYES_20, BAYES_40, and 
BAYES_99 < BAYES_95 < BAYES_80.

It would not be unreasonable to constraint the BAYES_* scores
so that they are always monotonic in the predicted probability of
spam. This constraint would likely cause the scores associated with
other rules to change slightly, but might not reduce the overall
accuracy of SA in the mass check corpus (perhaps you're in some
kind of local minimum?)

I hope this makes sense. I'd be very interested in hearing about
other experiences with this.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                       Alan Schwartz <al...@uic.edu>
Author/Co-author of: "Managing Mailing Lists", "SpamAssassin", 
"Stopping Spam", and  "Practical Unix & Internet Security, 3rd Ed"
           Published by O'Reilly Media, Inc. (http://www.oreilly.com)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Re: BAYES_* scores - non-monotonic?

Posted by Alan Schwartz <al...@uic.edu>.

Quoting Henry Stern (henry@stern.ca):
> I've been thinking a bit about this problem and what we'd have to do to 
> the score learning system to accomodate monotonically increasing weights 
> for discrete representations of continuous-valued attributes.  Rather 
> than increasing the complexity of the learner, a much better solution 
> (from my perspective) is the following:
> 
> * Allow BAYES_00 to have a largeish negative value.
> * Constrain BAYES_05..99 to be >=0.
> * Instead of triggering only BAYES_20 when the output of Bayes.pm is 
> 0.2-0.4, trigger BAYES_00, BAYES_05 and BAYES_20.
> 
> Comments?

Very reasonable and clever approach to have the BAYES_* rules
represent cumulative additional probabilities. I suspect, however,
that if they've already tried doing the mass check with 
a modified weak ordering constraint (00 <= 05 <= 20 <= ... <= 99, 
along with BAYES_50 = 0 points), and found that it didn't work
out as expected. I think that's equivalent to your approach.

(I also haven't looked at the code enough to know whether BAYES_50 is
supposed to mean 'there is a 50% chance the message is spam' or '50%
of the evidence from tokens in this message says spam and 50% says ham'
- that is, there's a likelihood ratio of 1. If it's the former, BAYES_50
shouldn't be 0 points, of course, because its score should depend on the
overall base rate of spam vs. ham which varies by recipient. But I sort
of suspect it's the latter.)


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                       Alan Schwartz <al...@uic.edu>
Author/Co-author of: "Managing Mailing Lists", "SpamAssassin", 
"Stopping Spam", and  "Practical Unix & Internet Security, 3rd Ed"
           Published by O'Reilly Media, Inc. (http://www.oreilly.com)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-