You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Dale Luck <dl...@barracudanetworks.com> on 2005/08/30 15:29:20 UTC

body rule speed

In my study of where SA is spending most of its time, it became quickly apparent the do_body_tests is by far the largest cpu hog.
Indeed i've seen just a single file (sare_fraud) can use up half of the cpu cycles for every spam scan.
 
I was wondering if anyone investigating flipping inside out the algorithm used to apply the rules to the body.
 
Instead of the present process:
 
loop on all rules
{
     loop on all lines in msg body
     {
           apply regex.
           if hit, skip to next rule.
     }
}
 
The above the use of the 'study' perl command, as well as it case insensitive regexes cause memory allocation over and over again.
 
What if the process was changed to:
 
using_rule{all rules} = rule_score; # initializes decision point for list of rules
 
loop on all lines in msg body
{
     study line
     loop on all rules
     {
           if using_rule{rule} apply regex;
           if hit, using_rule{rule} = 0;
     }
}
 
The issue of speeding up /i regex could be handled by identifying them in the parse phase and having a separate case_insensitive_body_test which would remove the /i and then lc the rule before compiling it, and then prior to applying the regex in the inner loop it would lc the body text. Since lc the body text would be done only once for all the rules (instead of once for each rule) this would also speed up processng.
 
Prior to me coding this up and trying it out I was wondering if anyone else had already gone down this same path and determined it was fruitless.
 
dale luck

Re: body rule speed

Posted by Duncan Findlay <du...@debian.org>.
On Tue, Aug 30, 2005 at 06:29:20AM -0700, Dale Luck wrote:

> In my study of where SA is spending most of its time, it became
> quickly apparent the do_body_tests is by far the largest cpu hog.
> Indeed i've seen just a single file (sare_fraud) can use up half of
> the cpu cycles for every spam scan.
>  
> I was wondering if anyone investigating flipping inside out the
> algorithm used to apply the rules to the body.

I believe I tried to look at this one time, but it got pretty messy to
hack that in and I didn't have enough time to spend on it. Any speedup
seemed to be minimal, but it might be worth looking into in greater
detail. Also, I'm not convinced study helps a whole lot. Having said
that, some of our regular expressions could probably be tuned better
so that study helps more.

The case insensitive thing can be a very large speedup; however, we do
have many tests that rely on capitalization. We'd need a way of
splitting them up or something, since we definitely need some case
sensetive rules

-- 
Duncan Findlay

Re: body rule speed

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Aug 30, 2005 at 06:29:20AM -0700, Dale Luck wrote:
> Instead of the present process:
>  
> loop on all rules
> {
>      loop on all lines in msg body
>      {
>            apply regex.
>            if hit, skip to next rule.
>      }
> }
>  
> loop on all lines in msg body
> {
>      study line
>      loop on all rules
>      {
>            if using_rule{rule} apply regex;
>            if hit, using_rule{rule} = 0;
>      }
> }

Ah, you mean the original way. ;)   In short, it was found to be faster
to do the first one, since we don't have to run the same rule over and
over again.  We exit out on the first hit per rule which was found to
be a huge speed increase.

-- 
Randomly Generated Tagline:
It could be one of these chemicals here that makes him so smart.  Lisa,
 maybe you should try some of this.
 
 		-- Homer Simpson
 		   Bart the Genius