You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2011/11/29 23:22:50 UTC

Martin Gregorie's portmanteau rule building script

On 11/25/2011 10:13 AM, Martin Gregorie wrote:
> Subject: [Fwd: Re: How long a rule can be?]

My main answers to the original thread were posted there (today). I
guess I'm too accustomed to orderly threads; coupling my threaded view
in thunderbird with the big pile of mail unread since before the holiday
and I missed this thread when responding to the original.

If you want to fork the thread into a tangent, please change the subject
so other responses to it don't follow you.  Also, don't respond to the
parts of the thread you are not forking; those belong in another message
in the original thread.

</rant>


> If you're finding your rule is starting to get difficult to maintain,
> take a look at my rule assembly tool, which is designed to allow such
> rules to be defined in an easily edited file for each rule that are
> used to create a single .cf file. See: 
> http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz
> 
> I was thinking of using a server plus plugin to do this but was 
> convinced that this 'portmanteau rule' approach was better: it 
> certainly works well for me.

You might want to consider Regexp::Assemble for your tool, though that
would require using perl.  This would cause your man page's example rule
to result in something like this:

       body     __AU0 /(?i-xsm:\balt[123]\b)/

rather than your script's *much* slower:

       body     __AU0 /\b(alt1|alt2|alt3)\b/i


Re: Martin Gregorie's portmanteau rule building script

Posted by Adam Katz <an...@khopis.com>.
On 11/30/2011 03:59 AM, Martin Gregorie wrote:
> On Tue, 2011-11-29 at 14:22 -0800, Adam Katz wrote:
>> You might want to consider Regexp::Assemble for your tool, though
>> that would require using perl. This would cause your man page's
>> example rule to result in something like this:
>> 
>>        body     __AU0 /(?i-xsm:\balt[123]\b)/
>>
>> rather than your script's *much* slower:
>>
>>        body     __AU0 /\b(alt1|alt2|alt3)\b/i
>>
> Interesting idea. Currently my system's performance seems 'adequate',
> considering I'm running SA on an 866 mHz P3 box with 512 MB RAM:
>                 Min                     Avg      Max
> Scan times:     0.9 (   3401 bytes)     4.0    128.3 (  72858 bytes)
> Msg sizes:     2258 (    1.8 secs )   10474   507533 (    6.2 secs )
> Messages:      2032
> 
> What sort of speed-up would Regexp::Assemble provide? 
> How would that compare with compiling the portmanteau.cf file?

Great question.  I do not have an answer.

How much optimization does re2c provide?  I am under the impression all
it does is convert text-based PCREs to C/C++ code of some sort, which
fully(?) mimics the original regexp's logic, implying that optimization
before compilation matters a lot.

I popped into irc://freenode.net#regex to ask, but this is apparently
too archaic a question.  Maybe somebody will have an answer in time.  (I
am not motivated enough to create an impromptu benchmark suite myself.)


Re: Martin Gregorie's portmanteau rule building script

Posted by Martin Gregorie <ma...@gregorie.org>.
On Tue, 2011-11-29 at 14:22 -0800, Adam Katz wrote:

> If you want to fork the thread into a tangent, please change the subject
> so other responses to it don't follow you.  Also, don't respond to the
> parts of the thread you are not forking; those belong in another message
> in the original thread.
> 
That wasn't my intention. I *thought* I was merely adding an aside to
say "if you really want rules with lots of alternates, here's a tool
that can help" because I think we've all all struggled with rules that
straggle off the right edge of the page with many editors. I know vi/vim
will wrap those lines, but a lot of people dislike vi.

> You might want to consider Regexp::Assemble for your tool, though that
> would require using perl.  This would cause your man page's example rule
> to result in something like this:
> 
>        body     __AU0 /(?i-xsm:\balt[123]\b)/
> 
> rather than your script's *much* slower:
> 
>        body     __AU0 /\b(alt1|alt2|alt3)\b/i
> 
Interesting idea. Currently my system's performance seems 'adequate',
considering I'm running SA on an 866 mHz P3 box with 512 MB RAM:
                Min                     Avg      Max
Scan times:     0.9 (   3401 bytes)     4.0    128.3 (  72858 bytes)
Msg sizes:     2258 (    1.8 secs )   10474   507533 (    6.2 secs )
Messages:      2032

What sort of speed-up would Regexp::Assemble provide? 
How would that compare with compiling the portmanteau.cf file?
 

Martin