You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bart Schaefer <ba...@gmail.com> on 2004/09/16 22:33:12 UTC

Speak to me of Bayes and scoring in SA 3.0

Pointers to archived discussion - or, better, some kind of rationale
in the SA documentation - would be fine.  I haven't been able to
follow the developers list closely for quite some time.

What I'm curious about is why the BAYES_* rules are fed through the
scoring algorithm along with everything else.  (I understand about why
the scores end up being what they are after the algorithm finishes.)

Feeding the Bayes rules through the scoring algorithm seems to imply a
lack of trust in the accuracy of the classifier.  Perhaps this is a
side-effect of having very few negative-scoring rules that can lower
the score by looking for ham (lack of which, I comprehend, is to
prevent deliberate spoofing), but it would have made sense to me to
e.g. fix the score for BAYES_99 at 4.95, BAYES_95 at 4.75, BAYES_90 at
4.50, etc., and then let the scoring algorithm fit the rest of the
rules around that.  Perhaps not so linear a mapping, but you get the
idea.

So, why not?

Re: Speak to me of Bayes and scoring in SA 3.0

Posted by Daniel Quinlan <qu...@pathname.com>.
Bart Schaefer <ba...@gmail.com> writes:

> Rather than divide the score sets by with/without Bayes, have multiple
> score sets and use the Bayes probability to choose which score set to
> apply.  (I.e., there is no direct score for Bayes itself.)  A Bayes
> probability of, say, 0.45 - 0.55 would use the same score set as
> "without Bayes," on the assumption that in that range Bayes is unable
> to contribute to the decision.

That's not a bad idea to try!  Can you submit a bug for it?

Daniel

-- 
Daniel Quinlan                     ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/  http://www.apachecon.com/  sessions & more)

Re: Speak to me of Bayes and scoring in SA 3.0

Posted by Bart Schaefer <ba...@gmail.com>.
On 16 Sep 2004 13:39:30 -0700, Daniel Quinlan <qu...@pathname.com> wrote:
> 
> I think we could use a better way to merge Bayesian results into the
> SpamAssassin score, though.

Hm.

An idea that just occurred to me, that would have been prohibitively
expensive with the GA but maybe isn't with the perceptron model:

Rather than divide the score sets by with/without Bayes, have multiple
score sets and use the Bayes probability to choose which score set to
apply.  (I.e., there is no direct score for Bayes itself.)  A Bayes
probability of, say, 0.45 - 0.55 would use the same score set as
"without Bayes," on the assumption that in that range Bayes is unable
to contribute to the decision.

My intuition, which may be wrong, would be that such an arrangement
would cause a big increase in the score values of a small number of
rules in the score sets for near-zero and near-one probability, though
not the same rules in each set.

Re: Speak to me of Bayes and scoring in SA 3.0

Posted by Daniel Quinlan <qu...@pathname.com>.
snowjack@fastmail.fm writes:

> I thought so too...

Well, I don't think the scores are the problem -- they are pretty much
as good as they can get given the training data.  I mean the entire
method of putting them into ranges and scoring those ranges.

-- 
Daniel Quinlan                     ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/  http://www.apachecon.com/  sessions & more)

Re: Speak to me of Bayes and scoring in SA 3.0

Posted by sn...@fastmail.fm.
On 16 Sep 2004 13:39:30 -0700, "Daniel Quinlan" <qu...@pathname.com>
said:
> Bart Schaefer <ba...@gmail.com> writes:
> 
> > Feeding the Bayes rules through the scoring algorithm seems to imply a
> > lack of trust in the accuracy of the classifier.
> 
> Mostly not.  It's needed to map from the 0 to 1.0 "probability" to the
> SpamAssassin threshold-based scoring method.  Even in more pure Bayesian
> systems, users still have to figure out where to put stuff into the spam
> bucket and it's often not at 0.50.  Our technique avoids the problem of
> people having two different calibrations.  Plus, there's the lack of
> trust thing, but that's a lesser factor.
> 
> I think we could use a better way to merge Bayesian results into the
> SpamAssassin score, though.

I thought so too... I added the following to my local.cf based on Bayes
scores of spam we receive. Spammers are really trying hard to make their
spams look hammy, but regular users are (hopefully) not trying to make
their hams look spammy. So I weighted the scores in that direction since
my Bayes engine seems much more likely to give my ham a very low score
than to give my spam a very high score. Spammers can fairly easily get
their Bayes scores down to about 50% probability, but it's much more
difficult to get them down below 40% probability since they would have
to know your particular organization's 'hammy' tokens (which would not
remain hammy for long if you're training regularly).

score BAYES_00 -4.9
score BAYES_01 -2.1
score BAYES_10 -1.5
score BAYES_20 -1.0
score BAYES_30 -0.5
score BAYES_40 0.1
score BAYES_44 0.7
score BAYES_50 1.0
score BAYES_56 1.5
score BAYES_60 2.1
score BAYES_70 3.1
score BAYES_80 4.2
score BAYES_90 4.9
score BAYES_99 5.4

-- 
  
  snowjack@fastmail.fm


Re: Speak to me of Bayes and scoring in SA 3.0

Posted by Daniel Quinlan <qu...@pathname.com>.
Bart Schaefer <ba...@gmail.com> writes:

> Feeding the Bayes rules through the scoring algorithm seems to imply a
> lack of trust in the accuracy of the classifier.

Mostly not.  It's needed to map from the 0 to 1.0 "probability" to the
SpamAssassin threshold-based scoring method.  Even in more pure Bayesian
systems, users still have to figure out where to put stuff into the spam
bucket and it's often not at 0.50.  Our technique avoids the problem of
people having two different calibrations.  Plus, there's the lack of
trust thing, but that's a lesser factor.

I think we could use a better way to merge Bayesian results into the
SpamAssassin score, though.

Daniel

-- 
Daniel Quinlan                     ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/  http://www.apachecon.com/  sessions & more)