You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nigel Wilkinson <ni...@waspz.co.uk> on 2005/03/06 01:08:10 UTC
I don't understand the Bayes scoring logic
Hi folks
can anyone explain the logic behind this.
Various spam gets tagged with the Bayes check but as follows
* 0.4 BAYES_60 BODY: Bayesian spam probability is 60 to 80%
* [score: 0.6343]
* 2.1 BAYES_80 BODY: Bayesian spam probability is 80 to 95%
* [score: 0.8695]
* 1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
* [score: 1.0000]
So, a 60-80% probability scores 0.4
an 80-95% probability scores 2.1
and a 99-100% scores 1.3
Why does a 99-100% probability score less than an 80-95% probability???
Answers on a postcard please to..............
Cheers
Nigel
Re: I don't understand the Bayes scoring logic
Posted by Matt Kettler <mk...@comcast.net>.
At 07:08 PM 3/5/2005, Nigel Wilkinson wrote:
>Why does a 99-100% probability score less than an 80-95% probability???
This is more-or-less a FAQ in SA now.
Rule scores in SA are not in any way linear.
The scores are not assigned based on performance, they're based on tuning
the scores of ALL of the rules together in such a way to minimize the total
of FP's and FN's with a 1:100 ratio (i.e. find the lowest FP +100*FN).
Because of this, rule scores are not assigned based on the performance of
one individual rule, but it's interactions with every other rule in the
ruleset.
In the case of BAYES_99, it would appear that most spam messages that hit
it also hit a lot of other rules, thus SA's score optimize could sacrifice
the score slightly to reduce the FPs without introducing a significant
number of FN's. However, the story may be different in BAYES_80.. here the
spams are likely to be more evasive, and might need a higher score from
this rule to avoid large numbers of FNs.
The other off-chance possibility is there may be some mis-placed spams in
the corpus the dev's used. Actualy, there's almost certainly one or two in
the lot, but if there's a decent number of them they can really screw up
the scores.
Re: I don't understand the Bayes scoring logic
Posted by Bob Proulx <bo...@proulx.com>.
Nigel Wilkinson wrote:
> Why does a 99-100% probability score less than an 80-95% probability???
Because the Bayes engine is not the only factor in classifying a
message as spam. Along with that all of the other rules are factored
into it too. A message which is 99-100% probability is going to
trigger many of the other SA rules. The total is enough to push the
message over the 5 point threshold. The scoring program therefore did
not need to make the BAYES_99 score any higher than it did. And I
also believe there is a value in the SA development team that no
single rule should be too large. It can lead to false positives. It
is better to be conservative and avoid false positives for the masses.
However, *I* don't like seeing the same spam again and again. With
the default values I would see a spam, train for it, and still see the
same spam again and again because it would only score BAYES_99 and be
below the threshold. Often this is before it is reported and before
network tests such as RBLs and SURBL can tag the sender. So I
increase the BAYES_95 and BAYES_99 points to 4.0 and 5.0 for my own
personal use. That way if the same spam comes through again, as I
know it will, it will get tagged. But I can't say with any authority
that this won't generate false positives. I can only say that I have
only myself to blame in that case and also that since I know what it
is doing I won't be surprised by it.
Bob