You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/07/12 15:43:18 UTC

Re: body speedups using new features in perl 5.9.x

hi David -- 

While I doubt it'd have quite the performance gains that A-C can offer,
Regexp::Assemble certainly sounds like something worth trying...
the coderef trick, in particular, is very nifty.

--j.

David Landgren writes:
> Justin Mason wrote:
> > There's an interesting discussion on my weblog at
> > http://taint.org/2006/07/07/184022a.html , regarding the
> > recently-contributed "trie" and Aho-Corasick regexp optimizations, which
> > have been added to perl 5.9.x.
> > 
> > These could provide a way to get *serious* speedups in SpamAssassin.
> 
> Note that these work really well on fixed text. Once you start adding 
> STAR, PLUS and CURLY the gain diminishes.
> 
> > If anyone is interested in working on speedups, this would be a really
> > great place to do it; body rules are by far the biggest CPU time sink in
> > our code.
> 
> One source of slowness is that a lot of SA pattern writers use 
> (?=[abcd]) zwla's in the belief, I think, to supply more information to 
> the engine, but even in older perls this was always a loss. You have to 
> have a something like 95-99% of strings not matching /[abcd]/ for the 
> benefit to kick in, at least that has been my experience.
> 
> I think the biggest problem is the fact that the scanning algorithm 
> consists of applying one pattern after the other to see what it triggers.
> 
> I wrote Regexp::Assemble to allow one to take a large slab of patterns 
> and concoct a single expression (also structured as a trie, 
> incidentally) as a result. The benefit is that all the common paths are 
> shared, so all the expensive curlies and stars are visited only one.
> 
> For instance, with /a+b+c+/ and /a+c+e+c+/ the result is 
> /a+(c+e+|b+)c+/, so if you are give aaaafc, you don't have to try the 
> first pattern, fail and then the second pattern and fail. As soon as the 
> alternation in the middle of the resulting pattern above fails, there's 
> no backtracking and so you stop. When you have three thousand patterns 
> instead of two, this starts to become very interesting.
> 
> Regexp::Assemble also has a mode whereby you can build up a hash of 
> pattern => coderef pairs, assemble all the patterns into a single 
> pattern, and then feed that the body text.
> 
> This then means that your entire scan of the body just becomes a single 
> //g pattern match, and each time something matches, you can go back to 
> the dispatch table and call out the action according to what just 
> matched. The more patterns you have, the more efficient it becomes.
> 
> I'd be happy to explain this in more detail if this sounds like 
> something you want to explore.
> 
> David
> -- 
> Much of the propaganda that passes for news in our own society is given 
> to immobilising and pacifying people and diverting them from the idea 
> that they can confront power. -- John Pilger