You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/07/12 15:43:18 UTC
Re: body speedups using new features in perl 5.9.x
hi David --
While I doubt it'd have quite the performance gains that A-C can offer,
Regexp::Assemble certainly sounds like something worth trying...
the coderef trick, in particular, is very nifty.
--j.
David Landgren writes:
> Justin Mason wrote:
> > There's an interesting discussion on my weblog at
> > http://taint.org/2006/07/07/184022a.html , regarding the
> > recently-contributed "trie" and Aho-Corasick regexp optimizations, which
> > have been added to perl 5.9.x.
> >
> > These could provide a way to get *serious* speedups in SpamAssassin.
>
> Note that these work really well on fixed text. Once you start adding
> STAR, PLUS and CURLY the gain diminishes.
>
> > If anyone is interested in working on speedups, this would be a really
> > great place to do it; body rules are by far the biggest CPU time sink in
> > our code.
>
> One source of slowness is that a lot of SA pattern writers use
> (?=[abcd]) zwla's in the belief, I think, to supply more information to
> the engine, but even in older perls this was always a loss. You have to
> have a something like 95-99% of strings not matching /[abcd]/ for the
> benefit to kick in, at least that has been my experience.
>
> I think the biggest problem is the fact that the scanning algorithm
> consists of applying one pattern after the other to see what it triggers.
>
> I wrote Regexp::Assemble to allow one to take a large slab of patterns
> and concoct a single expression (also structured as a trie,
> incidentally) as a result. The benefit is that all the common paths are
> shared, so all the expensive curlies and stars are visited only one.
>
> For instance, with /a+b+c+/ and /a+c+e+c+/ the result is
> /a+(c+e+|b+)c+/, so if you are give aaaafc, you don't have to try the
> first pattern, fail and then the second pattern and fail. As soon as the
> alternation in the middle of the resulting pattern above fails, there's
> no backtracking and so you stop. When you have three thousand patterns
> instead of two, this starts to become very interesting.
>
> Regexp::Assemble also has a mode whereby you can build up a hash of
> pattern => coderef pairs, assemble all the patterns into a single
> pattern, and then feed that the body text.
>
> This then means that your entire scan of the body just becomes a single
> //g pattern match, and each time something matches, you can go back to
> the dispatch table and call out the action according to what just
> matched. The more patterns you have, the more efficient it becomes.
>
> I'd be happy to explain this in more detail if this sounds like
> something you want to explore.
>
> David
> --
> Much of the propaganda that passes for news in our own society is given
> to immobilising and pacifying people and diverting them from the idea
> that they can confront power. -- John Pilger