You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/07/11 14:02:31 UTC

body speedups using new features in perl 5.9.x

There's an interesting discussion on my weblog at
http://taint.org/2006/07/07/184022a.html , regarding the
recently-contributed "trie" and Aho-Corasick regexp optimizations, which
have been added to perl 5.9.x.

These could provide a way to get *serious* speedups in SpamAssassin.

If anyone is interested in working on speedups, this would be a really
great place to do it; body rules are by far the biggest CPU time sink in
our code.

Be sure to read the comments -- the author of the code goes into a lot of
detail there!

--j.

Re: body speedups using new features in perl 5.9.x

Posted by David Landgren <da...@landgren.net>.
Justin Mason wrote:
> There's an interesting discussion on my weblog at
> http://taint.org/2006/07/07/184022a.html , regarding the
> recently-contributed "trie" and Aho-Corasick regexp optimizations, which
> have been added to perl 5.9.x.
> 
> These could provide a way to get *serious* speedups in SpamAssassin.

Note that these work really well on fixed text. Once you start adding 
STAR, PLUS and CURLY the gain diminishes.

> If anyone is interested in working on speedups, this would be a really
> great place to do it; body rules are by far the biggest CPU time sink in
> our code.

One source of slowness is that a lot of SA pattern writers use 
(?=[abcd]) zwla's in the belief, I think, to supply more information to 
the engine, but even in older perls this was always a loss. You have to 
have a something like 95-99% of strings not matching /[abcd]/ for the 
benefit to kick in, at least that has been my experience.

I think the biggest problem is the fact that the scanning algorithm 
consists of applying one pattern after the other to see what it triggers.

I wrote Regexp::Assemble to allow one to take a large slab of patterns 
and concoct a single expression (also structured as a trie, 
incidentally) as a result. The benefit is that all the common paths are 
shared, so all the expensive curlies and stars are visited only one.

For instance, with /a+b+c+/ and /a+c+e+c+/ the result is 
/a+(c+e+|b+)c+/, so if you are give aaaafc, you don't have to try the 
first pattern, fail and then the second pattern and fail. As soon as the 
alternation in the middle of the resulting pattern above fails, there's 
no backtracking and so you stop. When you have three thousand patterns 
instead of two, this starts to become very interesting.

Regexp::Assemble also has a mode whereby you can build up a hash of 
pattern => coderef pairs, assemble all the patterns into a single 
pattern, and then feed that the body text.

This then means that your entire scan of the body just becomes a single 
//g pattern match, and each time something matches, you can go back to 
the dispatch table and call out the action according to what just 
matched. The more patterns you have, the more efficient it becomes.

I'd be happy to explain this in more detail if this sounds like 
something you want to explore.

David
-- 
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger