You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Sidney Markowitz <si...@sidney.com> on 2004/04/16 00:31:30 UTC

Another Bayes tweak

I want to see what people think of this before I put something in Bugzilla.

There are two Bugzilla entries having to do with rule order and 
short-circuiting, http://bugzilla.spamassassin.org/show_bug.cgi?id=2912 
which is closed and 
http://bugzilla.spamassassin.org/show_bug.cgi?id=3109 which is still open.

They led me to think about Bayes processing as a special case because it 
is so expensive. I base that statement on sonic.net's experience of 
having difficulty deploying the latest SpamAssassin because of the I/O 
requirements of Bayes processing. The recent optimizations help, but I'm 
not sure if they are enough.

If Bayes were done last, as per bug #2912, or we had a short-circuit 
mechanism as in 3109, Bayes calculations could be skipped whenever the 
score exceeded some positive or negative threshold.

A very conservative approach would be to make the threshold limits be 
(required_score - BAYES_99) to (required_score - BAYES_00) which means 
skip Bayes processing whenever it cannot possibly make a difference. If 
there is enough high scoring spam and low scoring ham in the mail 
stream, then this would save a lot of processing load.

Since the Bayes score is not used in deciding when something should be 
autolearned, the problems of short-circuiting and autolearning are not a 
factor. Does that mean that we should use a special mechanism for Bayes 
that is simpler than whatever we eventually do for short-circuiting?

Does this make sense to people, or should we just dedicate ourselves to 
making sure that Bayes processing is so efficient that there will be no 
need to treat it as a special case?

  --sidney


Re: Another Bayes tweak

Posted by Daniel Quinlan <qu...@pathname.com>.
Eric Kolve <ek...@classmates.com> writes:

> What do you think of associating a cost and benefit score to each rule
> and then you would just iterate over all the rules in order of
> greatest benefit for least cost until you hit the spam threshold?
> This may be a bit extreme since you would have to do quite bit of work
> tagging all the rules, but should provide a nice optimization.

It's possible and might be reasonable if we did the sorting/ordering at
start-time instead of per-message.  Negative rules would still have to
go first and always be run and you'd want it to be pre-Bayes and
pre-network-result-harvesting for performance reasons.  Decision tree
would share a lot of the same (predicted) benefits.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: Another Bayes tweak

Posted by Eric Kolve <ek...@classmates.com>.
On Thu, Apr 15, 2004 at 04:26:30PM -0700, Daniel Quinlan wrote:
> Sidney Markowitz <si...@sidney.com> writes:
> 
> > Does this make sense to people, or should we just dedicate ourselves
> > to making sure that Bayes processing is so efficient that there will
> > be no need to treat it as a special case?
> 
> There are other slow rules.  Language guessing, for example.
> 
> I'd rather devote time to:
> 
>  - making code generally more efficient
>  - ways to make message checks more efficient in general (early exit is
>    one option if it actually speeds things up)
> 
> I just had an interesting idea of how to make checks much faster.  What
> if we did decision tree, but only to determine whether or not all rules
> would be evaluated?
> 
>   [DECISION TREE] -> definitely spam OR maybe spam
> 
>     (there is no "maybe ham" or "ham" output from the tree, so no free
>     pass if a spammer figures out a safe path through the tree)
> 
>   if maybe spam, then
> 
>     [PERCEPTRON] -> spam or ham
> 
>   if definitely spam, then
> 
>     no more work to do
> 

What do you think of associating a cost and benefit score
to each rule and then you would just iterate over all the rules
in order of greatest benefit for least cost until you hit the
spam threshold?  This may be a bit extreme since you would have to
do quite bit of work tagging all the rules, but should provide a nice
optimization.

--eric



> Daniel
> 
> -- 
> Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
> http://www.pathname.com/~quinlan/    and open source consulting

Re: Another Bayes tweak

Posted by Kelsey Cummings <kg...@sonic.net>.
On Fri, Apr 16, 2004 at 12:05:32PM +1200, Sidney Markowitz wrote:
> Daniel Quinlan wrote:
> >There are other slow rules. Language guessing, for example.
> 
> Yes, language guessing is slow and I've yet to find a way to speed it up 
> after the first optimization I did on it. It's a Bayesian calculation 
> much like the Bayes classifier. But are there any others? Also, do we 
> get good enough results from the language classifier for it to be 
> worthwhile?
> 
> I'm under the impression that Bayes is in a class by itself in terms of 
> how slow and how useful it is.

It's the only rule that requires significant IO.  That certainly puts it in
a different class.  Saving CPU cycles is worth the effort but we'll see
alot more return from profiled IO.  In terms of Bayes SQL we'll keep on
working with Michael and Sidney on performance enhancements.  The latest
SQL code is a huge improvement over the last but I think there is still
room for improvement.

For what it's worth, tuned innodb tables seem to be faster than myisam.

-- 
Kelsey Cummings - kgc@sonic.net           sonic.net, inc.
System Administrator                      2260 Apollo Way
707.522.1000 (Voice)                      Santa Rosa, CA 95407
707.547.2199 (Fax)                        http://www.sonic.net/
Fingerprint = D5F9 667F 5D32 7347 0B79  8DB7 2B42 86B6 4E2C 3896

Re: Another Bayes tweak

Posted by Sidney Markowitz <si...@sidney.com>.
Daniel Quinlan wrote:
> There are other slow rules. Language guessing, for example.

Yes, language guessing is slow and I've yet to find a way to speed it up 
after the first optimization I did on it. It's a Bayesian calculation 
much like the Bayes classifier. But are there any others? Also, do we 
get good enough results from the language classifier for it to be 
worthwhile?

I'm under the impression that Bayes is in a class by itself in terms of 
how slow and how useful it is.

> ways to make message checks more efficient in general (early exit is
> one option if it actually speeds things up)

Well, if we get early exit, that would take care this, if Bayes was 
scheduled last.

That does bring to mind a complication with early exit and Bayes: 
Doesn't early exit imply running all negative rules first? But Bayes has 
both positive and negative rules coming out of the calculation. So 
something would have to be done to treat Bayes as a special case.

> if we did decision tree, but only to determine whether or not
> all rules would be evaluated?

I don't understand how this would work. Rules mostly don't depend on 
each other, so how do you form the decision tree other than evaluating 
the highest score rules first and bailing out when subsequent scores 
can't make a difference in results, i.e., short-circuiting?

  -- sidney


Re: Another Bayes tweak

Posted by Daniel Quinlan <qu...@pathname.com>.
Sidney Markowitz <si...@sidney.com> writes:

> Does this make sense to people, or should we just dedicate ourselves
> to making sure that Bayes processing is so efficient that there will
> be no need to treat it as a special case?

There are other slow rules.  Language guessing, for example.

I'd rather devote time to:

 - making code generally more efficient
 - ways to make message checks more efficient in general (early exit is
   one option if it actually speeds things up)

I just had an interesting idea of how to make checks much faster.  What
if we did decision tree, but only to determine whether or not all rules
would be evaluated?

  [DECISION TREE] -> definitely spam OR maybe spam

    (there is no "maybe ham" or "ham" output from the tree, so no free
    pass if a spammer figures out a safe path through the tree)

  if maybe spam, then

    [PERCEPTRON] -> spam or ham

  if definitely spam, then

    no more work to do

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting