You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Rocky Olsen <ro...@mindphone.org> on 2005/03/31 05:34:05 UTC

Rule Design Benchmark/Resource Question

Before i pull my hair out doing bench/resource test, i was wondering if
anyone out there knew if there was much of a speed/resource usage
difference between the following way of writing the same rule.


Method A:
body	rule_a		/(?:feh|meh|bleh)/i

vs.

Method B:

bod		__rule_a	/(?:feh)/i
body	__rule_b	/(?:meh)/i
body	__rule_c	/(?:bleh)/i

meta	rule_d		(__rule_a || __rule_b || __rule_c)


There probably isn't much difference using just 3 rules, but i'm thinking
more along the lines of large(500+) lists and it isn't limited to just body
stuff.  So if anyone has some realworld benching/experience with what is
preferred or if the developers know which is faster for SA, i would love
the input.

-Rocky
-- 
______________________________________________________________________


what's with today, today?

Email:	rocky@mindphone.org
PGP:	http://rocky.mindphone.org/rocky_mindphone.org.gpg

Re: Rule Design Benchmark/Resource Question

Posted by Rocky Olsen <ro...@mindphone.org>.

Thanks

On Thu, Mar 31, 2005 at 05:16:25PM -0500, Matt Kettler wrote:
> Rocky Olsen wrote:
> 
> >Before i pull my hair out doing bench/resource test, i was wondering if
> >anyone out there knew if there was much of a speed/resource usage
> >difference between the following way of writing the same rule.
> >
> >
> >Method A:
> >body	rule_a		/(?:feh|meh|bleh)/i
> >
> >vs.
> >
> >Method B:
> >
> >bod		__rule_a	/(?:feh)/i
> >body	__rule_b	/(?:meh)/i
> >body	__rule_c	/(?:bleh)/i
> >
> >meta	rule_d		(__rule_a || __rule_b || __rule_c)
> >
> >
> >There probably isn't much difference using just 3 rules, but i'm thinking
> >more along the lines of large(500+) lists and it isn't limited to just body
> >stuff.  So if anyone has some realworld benching/experience with what is
> >preferred or if the developers know which is faster for SA, i would love
> >the input.
> >  
> >
> 
> To start with, use perl's regex debugger as your friend:
> 
> $perl -Mre=debug -e  "/(?:feh|meh|bleh)/i"
> size 11 Got 92 bytes for offset annotations.
> 
> $ perl -Mre=debug -e  "/(?:feh)/i"
> Freeing REx: `","'
> Compiling REx `(?:feh)'
> size 3 Got 28 bytes for offset annotations.
> 
> (repeat 2 times)
> 
> However, this only deals with part of the story. The cost of the regex
> itself. It does not deal with the per-rule overhead in SA.
> 
> In general I'd favor the combined approach, unless for some reason your
> combined rule is considerably larger than the sum of it's parts. Bigevil
> ran much better once Chris S did some combining and common subexpression
> elimination.
> 
> 
> 
> 
> Also, I'd suggest eliminating the (?:) for the single-text-matches. It
> does nothing of use, and doesn't change the evaluation of the regex any
> for a simple single text match. All it does is waste 4 bytes of disk
> space per rule.
> 
> body __RULE_A   /feh/i
> 
> instead of:
> body __RULE_A   /(?:feh)/i
> 
> I leave comparing the two using re=debug as an exercise for the student.
> Also compare to /(feh)/i and /(feh)\1/i to see how backtracking works.
> 
> 
> 
> 
> 
> 
> 

-- 
______________________________________________________________________


what's with today, today?

Email:	rocky@mindphone.org
PGP:	http://rocky.mindphone.org/rocky_mindphone.org.gpg

Re: Rule Design Benchmark/Resource Question

Posted by Matt Kettler <mk...@evi-inc.com>.

Rocky Olsen wrote:

>Before i pull my hair out doing bench/resource test, i was wondering if
>anyone out there knew if there was much of a speed/resource usage
>difference between the following way of writing the same rule.
>
>
>Method A:
>body	rule_a		/(?:feh|meh|bleh)/i
>
>vs.
>
>Method B:
>
>bod		__rule_a	/(?:feh)/i
>body	__rule_b	/(?:meh)/i
>body	__rule_c	/(?:bleh)/i
>
>meta	rule_d		(__rule_a || __rule_b || __rule_c)
>
>
>There probably isn't much difference using just 3 rules, but i'm thinking
>more along the lines of large(500+) lists and it isn't limited to just body
>stuff.  So if anyone has some realworld benching/experience with what is
>preferred or if the developers know which is faster for SA, i would love
>the input.
>  
>

To start with, use perl's regex debugger as your friend:

$perl -Mre=debug -e  "/(?:feh|meh|bleh)/i"
size 11 Got 92 bytes for offset annotations.

$ perl -Mre=debug -e  "/(?:feh)/i"
Freeing REx: `","'
Compiling REx `(?:feh)'
size 3 Got 28 bytes for offset annotations.

(repeat 2 times)

However, this only deals with part of the story. The cost of the regex
itself. It does not deal with the per-rule overhead in SA.

In general I'd favor the combined approach, unless for some reason your
combined rule is considerably larger than the sum of it's parts. Bigevil
ran much better once Chris S did some combining and common subexpression
elimination.

Also, I'd suggest eliminating the (?:) for the single-text-matches. It
does nothing of use, and doesn't change the evaluation of the regex any
for a simple single text match. All it does is waste 4 bytes of disk
space per rule.

body __RULE_A   /feh/i

instead of:
body __RULE_A   /(?:feh)/i

I leave comparing the two using re=debug as an exercise for the student.
Also compare to /(feh)/i and /(feh)\1/i to see how backtracking works.

Re: Rule Design Benchmark/Resource Question

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Rocky,

Wednesday, March 30, 2005, 7:34:05 PM, you wrote:

RO> Before i pull my hair out doing bench/resource test, i was wondering if
RO> anyone out there knew if there was much of a speed/resource usage
RO> difference between the following way of writing the same rule.

RO> Method A:
RO> body	rule_a		/(?:feh|meh|bleh)/i
RO> vs.
RO> Method B:
RO> bod		__rule_a	/(?:feh)/i
RO> body	__rule_b	/(?:meh)/i
RO> body	__rule_c	/(?:bleh)/i
RO> meta	rule_d		(__rule_a || __rule_b || __rule_c)

Well, the bod rule won't work very well...    :-)

RO> There probably isn't much difference using just 3 rules, but i'm thinking
RO> more along the lines of large(500+) lists and it isn't limited to just body
RO> stuff.  So if anyone has some realworld benching/experience with what is
RO> preferred or if the developers know which is faster for SA, i would love
RO> the input.

SARE's experience with BigEvil and EvilNumbers is that feh|meh|bleh
won't be much different between these, but if you end up with
dozens/hundreds of these, you'll definitely want to use combined
regex, along the lines of

body rule_ab /ab(?:ort|racadabra|raham)/
body rule_ac /ac(?:tion|cess|limate)/
body rule_x  /x(?:ylophone|avier)/

Pulling out the leading character(s) from long strings will have a
significant impact on the speed of the regex. The more you can do
that, the more you'll benefit.

But 500+ strings will be expensive no matter what you do.

Bob Menschel