You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nigel Wilkinson <ni...@waspz.co.uk> on 2005/03/06 01:08:10 UTC

I don't understand the Bayes scoring logic

Hi folks

can anyone explain the logic behind this.
Various spam gets tagged with the Bayes check but as follows

	*  0.4 BAYES_60 BODY: Bayesian spam probability is 60 to 80%
	*      [score: 0.6343]

	*  2.1 BAYES_80 BODY: Bayesian spam probability is 80 to 95%
	*      [score: 0.8695]

	*  1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
	*      [score: 1.0000]

So, a 60-80% probability scores 0.4
an 80-95% probability scores 2.1
and a 99-100% scores 1.3

Why does a 99-100% probability score less than an 80-95% probability???

Answers on a postcard please to..............

Cheers
Nigel

Re: I don't understand the Bayes scoring logic

Posted by Matt Kettler <mk...@comcast.net>.
At 07:08 PM 3/5/2005, Nigel Wilkinson wrote:

>Why does a 99-100% probability score less than an 80-95% probability???


This is more-or-less a FAQ in SA now.

Rule scores in SA are not in any way linear.

The scores are not assigned based on performance, they're based on tuning 
the scores of ALL of the rules together in such a way to minimize the total 
of FP's and FN's with a 1:100 ratio (i.e. find the lowest FP +100*FN).

Because of this, rule scores are not assigned based on the performance of 
one individual rule, but it's interactions with every other rule in the 
ruleset.

In the case of BAYES_99, it would appear that most spam messages that hit 
it also hit a lot of other rules, thus SA's score optimize could sacrifice 
the score slightly to reduce the FPs without introducing a significant 
number of FN's. However, the story may be different in BAYES_80.. here the 
spams are likely to be more evasive, and might need a higher score from 
this rule to avoid large numbers of FNs.

The other off-chance possibility is there may be some mis-placed spams in 
the corpus the dev's used. Actualy, there's almost certainly one or two in 
the lot, but if there's a decent number of them they can really screw up 
the scores.



Re: I don't understand the Bayes scoring logic

Posted by Bob Proulx <bo...@proulx.com>.
Nigel Wilkinson wrote:
> Why does a 99-100% probability score less than an 80-95% probability???

Because the Bayes engine is not the only factor in classifying a
message as spam.  Along with that all of the other rules are factored
into it too.  A message which is 99-100% probability is going to
trigger many of the other SA rules.  The total is enough to push the
message over the 5 point threshold.  The scoring program therefore did
not need to make the BAYES_99 score any higher than it did.  And I
also believe there is a value in the SA development team that no
single rule should be too large.  It can lead to false positives.  It
is better to be conservative and avoid false positives for the masses.

However, *I* don't like seeing the same spam again and again.  With
the default values I would see a spam, train for it, and still see the
same spam again and again because it would only score BAYES_99 and be
below the threshold.  Often this is before it is reported and before
network tests such as RBLs and SURBL can tag the sender.  So I
increase the BAYES_95 and BAYES_99 points to 4.0 and 5.0 for my own
personal use.  That way if the same spam comes through again, as I
know it will, it will get tagged.  But I can't say with any authority
that this won't generate false positives.  I can only say that I have
only myself to blame in that case and also that since I know what it
is doing I won't be surprised by it.

Bob