You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/11/18 21:17:21 UTC

Re: would SA benefit from port to Java

Giampaolo Tomassoni writes:
> From: Matt Kettler [mailto:mkettler_sa@verizon.net]
> > 1) perl has a substantial base of text parsing and utility libraries
> > that no other language can match.. Java does have native regex
> > support, so it has a leg up over the others,
> 
> Right, but both langs are not that much suited for scoring a message:
> they apply all the rules to the very same piece of text.
> 
> It would be interesting, instead, to "invert" this approach by designing
> a finite state machine which is basicly a pre-compiled version of the
> whole rule body. You feed once the message in, and you get the results
> (i.e.: fired rules and/or message score).
> 
> I believe that this approach would reduce memory consumption as well as
> execution time a lot.
> 
> It would not be suitable for custom plugins, however. But all the
> standard rules (even the "expensive" ones in terms of computational
> power and memory footprint) would probably perform better this way.
> 
> The basic idea in the FSM model is that the pre-compiler is going to run
> just sometimes, maybe when a rule gets changed, added or deleted to the
> rule body. The pre-compiler could eventually even optimize the resulting
> FSM, perhaps by "merging" together paths shared by different rules. The
> .cf files syntax would not even need to be changed and this method could
> even allow for injecting a new, pre-compiled rule body version into an
> alive spamassassin.
> 
> Optionally, the FSM approach could be implemented the well-appreciated,
> actual perl by use of an external perl module.
> 
> Did anybody heard or thought of something like this?
> 
> Do you believe that an FSM would really improve SA performances?

Recently in the perl "blead" code, one of the perl hackers has added a
trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently
match multiple regular expressions in parallel, to the perl core regexp
matching code.  That's pretty much what you're describing, and I'm looking
into rewriting bits of SpamAssassin to take advantage of that (in the
"jm_re2c_hacks" branch).

Hopefully it will run faster than the current regexp matching system,
which is actually quite fast as it stands!  (The perl regular expression
matching engine is _very_ efficient.)

There's also an re2c-based version, which already outperforms basic
SpamAssassin by 15-20%, btw.

They almost definitely will not reduce memory usage, though. ;)

--j.

RE: would SA benefit from port to Java

Posted by Giampaolo Tomassoni <g....@libero.it>.
From: jm@jmason.org [mailto:jm@jmason.org]
>
> ...omissis
>
> Recently in the perl "blead" code, one of the perl hackers has added a
> trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently
> match multiple regular expressions in parallel, to the perl core regexp
> matching code.  That's pretty much what you're describing,

Yes, I think so too. I didn't know the name of such a beast. Aho-Corasick. It should definitely work. How could something named Aho-Corasick not to work? :)

Thank you for naming it.


> and I'm looking
> into rewriting bits of SpamAssassin to take advantage of that (in the
> "jm_re2c_hacks" branch).
> 
> Hopefully it will run faster than the current regexp matching system,
> which is actually quite fast as it stands!  (The perl regular expression
> matching engine is _very_ efficient.)
> 
> There's also an re2c-based version, which already outperforms basic
> SpamAssassin by 15-20%, btw.
> 
> They almost definitely will not reduce memory usage, though. ;)

Mmmmh, I had the impression that all that strings being created, cloned, used, merged and the like in spite of being fed to the regexes would be one of the reasons of big memory usage. So, I'm wrong in this...

What's the memory-hungry piece of code, then?

giampaolo


> 
> --j.