You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/11/10 18:05:43 UTC
Re: 3.2.0?

Kevin A. McGrail writes:
> > hey -- anyone think we should consider getting 3.2.0 out before January? I
> > think it may be doable.
> >
> > The one major feature I want to get in is the re2c/sa-compile speedup code
> > in the side branch -- it provides about a 20% speedup of scanning by
> > compiling parts of the ruleset into native code, which is nice. ;)
> 
> I would like to see it be released before January.  The 20% speedup sounds 
> amazing especially because I see more and more rules each day.  Is there any 
> reduced RAM usage as well?  I assume there is 20% less CPU usage just 
> because it finishes quicker.

Yep, CPU time goes down by a similar amount (SpamAssassin is generally
CPU-bound).

However I don't think it really helps RAM usage; it probably increases it
a little, unfortunately.  I agree reducing RAM usage is important though,
esp nowadays that the RAM-to-CPU bandwidth is becoming even more of a
bottleneck than CPU time... need to look into this more.

Here are some timings, btw.  I tested it on a couple of weeks of my corpus
-- 3395 hams and 15795 spams -- using perl 5.8.8, mass-check, and the
latest SVN trunk ruleset including sandbox rules.

Without rule2xs active:

real   avg=2037.131s min=2032.047s max=2045.501s count=3
user   avg=1884.417s min=1881.802s max=1887.930s count=3
sys    avg=29.990s min=28.354s max=31.446s count=3

that's (19190 / 2037.131) = 9.42 messages/sec.

With the compiled ruleset:

real   avg=1781.106s min=1769.190s max=1797.974s count=4
user   avg=1637.173s min=1633.754s max=1640.727s count=4
sys    avg=27.706s min=22.197s max=31.578s count=4

= 10.77 messages/sec, about a 14% speedup.  (It varies depending on what
rules are loaded and what mail is scanned, btw, hence 14 != 20.)

> On a similar topic, perhaps, I have been contemplating if the compilation to 
> native code could do something to not require ?: on every () regexp.   I 
> find that A) I'm lazy on adding them and B) they can get insane on trying to 
> read and debug some of the more complex rules.

Yeah -- it'll do this automatically.  However it's an optional plugin,
and most people will probably not be using it -- so it can't be
counted on being loaded :(

for what it's worth, we should be extending --lint to warn about these--
that would make it pretty clear when it needs to be fixed I think.

> I've been talking with Mark Damrose about this and since you have to use \\1 
> \\2, for the replacements, could the "re2c/sa-compile" be changed to 
> additionally automatically add ?: to regexp without \\1, etc.?  This should 
> save a little on RAM and overhead, though I'm not sure how much really.

hmm, unfortunately \1 and so on are too advanced for the rule2xs compiler;
it'll leave those rules as non-compiled body rules. Unfortunately re2c
isn't up to the full perl regexp vocabulary -- despite the sterling work
that Matt Sergeant has done in writing the compiler code to translate much
of it, there's still a lot of flexibility in perl's regexps that don't
translate to the re2c model (something to do with DFAs vs NFAs I think ;)

(oh yeah -- credit where due -- Matt is the guy who wrote much of this,
esp the rule2xs code which translates perl regexps into re2c in the form
of a perl XS module.  My hacking is mostly glue ;)

--j.