You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Matt Linzbach <ML...@Merchant-Gould.com> on 2004/12/29 19:04:49 UTC

Bayesian scoring in 3.0

Can someone point me to a thread that would discuss the scoring of the Bayesian rules in 3.0.  Specifically why BAYES_99 would score less than BAYES_95 for bayes+net tests?

TIA

Matt

Re: Bayesian scoring in 3.0

Posted by Jim Maul <jm...@elih.org>.

Matt Linzbach wrote:
> Can someone point me to a thread that would discuss the scoring of the Bayesian rules in 3.0.  Specifically why BAYES_99 would score less than BAYES_95 for bayes+net tests?
> 
> TIA
> 


http://article.gmane.org/gmane.mail.spam.spamassassin.general/61212/match=

just for 1 :)

You might also want to check

http://wiki.apache.org/spamassassin/HowScoresAreAssigned

-Jim

Re: Bayesian scoring in 3.0

Posted by Matt Kettler <mk...@evi-inc.com>.

At 01:04 PM 12/29/2004, Matt Linzbach wrote:
>Can someone point me to a thread that would discuss the scoring of the 
>Bayesian rules in 3.0.  Specifically why BAYES_99 would score less than 
>BAYES_95 for bayes+net tests?

Why would you expect it to be higher? It's a common human perception that 
everything is simple and linear. Unfortunately, nothing about SA scoring is 
simple, nor linear.

The big thing to keep in mind that rules are NOT independent entities. They 
are NOT scored based on their performance.

The score that BAYES_99 gets is not a function of it's performance, but the 
performance of every other rule in the ruleset. Furthermore they are most 
heavily biased by the rules that match the same messages in the corpus 
test. They are still affected by other rules which don't match any of the 
same messages, just because those rules affect other rules, which 
eventually get around to affecting BAYES_99.. (this relationship works a 
bit like the Kevin Bacon number.)

As an example, consider the case another rule that also performs well is 
added to the syste,, and has similar false positive problems and similar 
spam matches. In this situation the perceptron (or GA in older versions of 
SA) is going to have to trade off the score between the two rules. 
Generally, it's going to heavily bias towards the rule with the least FPs, 
but only because it's trying to tune the scores to get the lowest number 
for (FP+(FN/100)) it can.

In this case, I suspect BAYES was heavily biased by the URIBL rules. Some 
of those have VERY high hit rates, and VERY low FP problems, much less than 
the theoretical 0.5% that BAYES_99 has.