You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/16 00:52:41 UTC

Re: Another Bayes tweak

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Sidney Markowitz writes:
> I want to see what people think of this before I put something in Bugzilla.
> 
> There are two Bugzilla entries having to do with rule order and 
> short-circuiting, http://bugzilla.spamassassin.org/show_bug.cgi?id=2912 
> which is closed and 
> http://bugzilla.spamassassin.org/show_bug.cgi?id=3109 which is still open.
> 
> They led me to think about Bayes processing as a special case because it 
> is so expensive. I base that statement on sonic.net's experience of 
> having difficulty deploying the latest SpamAssassin because of the I/O 
> requirements of Bayes processing. The recent optimizations help, but I'm 
> not sure if they are enough.
> 
> If Bayes were done last, as per bug #2912, or we had a short-circuit 
> mechanism as in 3109, Bayes calculations could be skipped whenever the 
> score exceeded some positive or negative threshold.
> 
> A very conservative approach would be to make the threshold limits be 
> (required_score - BAYES_99) to (required_score - BAYES_00) which means 
> skip Bayes processing whenever it cannot possibly make a difference. If 
> there is enough high scoring spam and low scoring ham in the mail 
> stream, then this would save a lot of processing load.
> 
> Since the Bayes score is not used in deciding when something should be 
> autolearned, the problems of short-circuiting and autolearning are not a 
> factor. Does that mean that we should use a special mechanism for Bayes 
> that is simpler than whatever we eventually do for short-circuiting?
> 
> Does this make sense to people, or should we just dedicate ourselves to 
> making sure that Bayes processing is so efficient that there will be no 
> need to treat it as a special case?

Hmm.  The question is, would it have a big effect?

There are very few negative rules -- Bayes is pretty much the only big
hitter there -- so for ham, it would have no effect.  It may reduce load
if the mail stream is mostly spam, though.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAfxI5QTcbUG5Y7woRAhneAKDDOfcym+5UMoJd1itsA9zp9PRqRgCfXUbX
Hp0cowDjUaOL6zojPgdHy50=
=7wKx
-----END PGP SIGNATURE-----


Re: Another Bayes tweak

Posted by Sidney Markowitz <si...@sidney.com>.
Justin Mason wrote:
> It may reduce load
> if the mail stream is mostly spam, though.

Here's a data point I got from sonic.net:

 > from a sample of 445278 messages, 94760 of them fit your >= 10
 > || <= -0.5 criteria. So it looks like we'd get ~25% speedup from
 > that rule alone.

  -- sidney


Re: Another Bayes tweak

Posted by Sidney Markowitz <si...@sidney.com>.
Justin Mason wrote:
> for ham, it would have no effect

I agree. I was just being complete by specifying a lower threshold as 
well as an upper one. There is no negative effect even if it doesn't 
help by a significant amount.

> It may reduce load if the
> mail stream is mostly spam, though.

Unfortunately, that's most mail streams nowadays :-(

I asked Kelsey if he could estimate what percent of sonic.net's mail 
stream would score 10 or higher without Bayes and if that percentage is 
high enough to make the difference about being able to deploy Bayes.

  --sidney