You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2011/10/24 03:44:09 UTC

(Non-) Capturing REs (was: Re: How to write rule for From: line)

[ Plain regex rules sloppily using capturing rather than non-capturing
  grouping snipped. ]

On Mon, 2011-10-24 at 02:27 +0200, wolfgang wrote:
> As far as I know, with alternations you should use "?:" at their 
> beginning to avoid (superfluous) memory usage:
> 
> header FOO  From:name =~ /\b(?:sex|free|trial|enlarge\w{0,5})\b/i
> 
> That prevents Perl from "remembering" which item of the alternation 
> matched in your rule. With many rules scanning many mails using many 
> alternations, that can make a signficant difference in memory usage / 
> performance.

True, though merely in context of memory usage (and its performance) for
captured matches -- likely negligible, and only in the case of a regex
rule match. However...


Ah, the capturing vs non-capturing RE topic. :)  Something I've been
wondering about a few times. You're correct that using the non-
capturing (?:...) extended regular expression is being suggested
occasionally, to avoid the "substantial performance penalty" (see below)
that comes with using the capturing parentheses -- but is that true!?

The relevant part of the perlre documentation [1] is slightly vague and
ambiguous. Or maybe I've hit a language barrier.

However, as I read it, the warning is referring to the usage of the
special $&, $` and $' match capturing variables, resulting in a
substantial performance penalty -- and mentions the non-capturing
extended regex in this *context*, since it uses the same mechanism for
the $n matches. If these special vars are used.

Now, I just grepped the entire SA source code, and NONE of these spacial
vars are used. Yay!  (I did not grep all external SA dependencies, mind
you.)


So, does this "substantial performance penalty" using capturing groups
even apply to SA?

Is it really worth it, religiously using non-capturing grouping?

Wolfgang specifically mentioned the memory usage, not referring to the
substantial penalty above -- and I don't want to put words into his
mouth -- but I believe that clear warning about the special vars is what
everyone has in mind when preferring non-capturing over capturing groups
in the context of tiny rule matches, no?


[1] At the end of section Modifiers, sub-section Capture Groups, right
    before section Quoting Metacharacters.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: (Non-) Capturing REs

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2011-10-25 at 18:46 -0700, Adam Katz wrote:
> On 10/24/2011 02:45 PM, Karsten Bräckelmann wrote:

>         [...] though I seem to recall the SA debug output includes the
> matched text (which implies $&), though if this were important, I'm sure
> we'd have already concluded it worthwhile to do stupid things like
> surrounding entire regexps with (?=this).

Good point, the debug output indeed includes the match -- coincidentally
a fact I very recently stressed in another thread. As I said earlier, I
didn't find any use of these special match capturing variables grepping
the code. This made me curious, and digging for it quickly shows the
following in Check.pm:

  # note: keep this in 'single quotes' to avoid the $ & performance hit,
  # unless specifically requested by the caller.   Also split the
  # two chars, just to be paranoid and ensure that a buggy perl interp
  # doesn't impose that hit anyway (just in case)

And the code evaluated later to the special $& match variable, only run
in debug mode:

    $match = '($' . '&' . '|| "negative match")';


Given the rather explicit comment and this code, I am getting convinced
my interpretation of the perlre docs was correct. The global performance
penalty for capturing grouping ONLY applies, if these special vars are
used. It's a non-issue for normal operation [1], regardless of capturing
or non-capturing grouping.

I'd still be happy for someone with more in-depth knowledge about the
subject to confirm -- or disprove.


[1] For local rules, anyway. Regarding the bulk of the stock rules, I
    still support using (?:non) capturing grouping as best-practice.


> > Not trying to be confrontational, just honestly asking and wondering
> > about the real impact. After all, the perlre docs specifically 
> > mention to strongly prefer non-capturing grouping basically once
> > only -- in the warning paragraph about the special vars.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: (Non-) Capturing REs

Posted by Adam Katz <an...@khopis.com>.
On Mon, 2011-10-24 at 13:58 -0700, Adam Katz wrote:
>> Using special variables like those you mentioned are particularly 
>> bad, [...] That's not to say that the extra memory consumption
>> from an unnecessary grouping doesn't impact performance.

On 10/24/2011 02:45 PM, Karsten Bräckelmann wrote:
> Well, does it? Measurably? Even if the RE does *not* match?

If the RE doesn't match, I doubt it.  Not sure though.

> If so, does it still have any measurable effect, if we're talking a 
> handful custom rules, with stock rules using non-capturing grouping?
> (The objective here is a trade-off between optimized REs and not 
> confusing users who aren't intimately familiar with REs. They tend to
> get heavy to grasp rather quickly, and the extra ?: weird chars don't
> help that.)

Interesting point.  Maybe we shouldn't get into such detail with an
admin that just wants to add a few rules.

Also, there are better ways to optimize rules; e.g. assuming matchers
don't consume memory if the RE doesn't match, starting the RE
unambiguously -- non-parenthetical, non-globbed, non-character-class,
etc, ideally starting with a rare character; /\bfoo (bar|vaz)/ is better
than /(foo bar|foo vaz)/ while perl's left-to-right nature makes the
gain on /\b(hello|goodbye) world\b/ over /(hello world|goodbye world)/
far less notable (it's only notable if lots of other things commonly
follow "hello").

>>> Is it really worth it, religiously using non-capturing grouping?
>> 
>> From the profiling I've seen, yes it is.  (I don't have data to 
>> share though, sorry).
> 
> The profiled code, does it use the special match capturing variables
> *anywhere* in the entire program? The profiled and compared 
> versions, would that be like the equivalent of using capturing vs 
> non-capturing in all SA stock rules?

I'm not sure, though I seem to recall the SA debug output includes the
matched text (which implies $&), though if this were important, I'm sure
we'd have already concluded it worthwhile to do stupid things like
surrounding entire regexps with (?=this).

> Not trying to be confrontational, just honestly asking and wondering
> about the real impact. After all, the perlre docs specifically 
> mention to strongly prefer non-capturing grouping basically once
> only -- in the warning paragraph about the special vars.

The perl docs may have cut that for simplicity, just as you're
suggesting above ;-)

In reality, optimizing (including gaming of Perl's built-in
optimizations) is quite non-trivial.  Here's an excerpt from O'Reilly's
Mastering Regular Expressions (2nd Ed, page 253):

> Let me give a somewhat crazy example: you find  (000|999)$  in a Perl
> script, and decide to turn those capturing parentheses into 
> non-capturing parentheses. This should make things a bit faster, you
> think, since the overhead of capturing can now be eliminated. But 
> surprise, this small and seemingly beneficial change can slow this 
> regex down by /several orders of magnitude/ (thousands and thousands 
> of times slower). /What!?/ It turns out that a number of factors come
> together just right in this example to cause the end of 'string/line
> anchor optimization' (pg 246) to be turned off when non-capturing 
> parentheses are used. I don't want to dissuade you from using 
> non-capturing parentheses with Perl--their use is beneficial in the 
> vast majority of cases--but in this particular case, it's a
> disaster.


Re: (Non-) Capturing REs

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2011-10-24 at 13:58 -0700, Adam Katz wrote:
> On 10/23/2011 06:44 PM, Karsten Bräckelmann wrote:
> > [...] as I read it, the warning is referring to the usage of the 
> > special $&, $` and $' match capturing variables, resulting in a 
> > substantial performance penalty -- and mentions the non-capturing 
> > extended regex in this *context*, since it uses the same mechanism
> > for the $n matches. If these special vars are used.
> 
> Using special variables like those you mentioned are particularly bad,
> especially with some of the older versions of perl (I seem to recall
> some of them getting big performance boosts in more recent perl
> revisions).  That's not to say that the extra memory consumption from an
> unnecessary grouping doesn't impact performance.

Well, does it? Measurably? Even if the RE does *not* match?

If so, does it still have any measurable effect, if we're talking a
handful custom rules, with stock rules using non-capturing grouping?
(The objective here is a trade-off between optimized REs and not
confusing users who aren't intimately familiar with REs. They tend to
get heavy to grasp rather quickly, and the extra ?: weird chars don't
help that.)


> > Now, I just grepped the entire SA source code, and NONE of these
> > spacial vars are used. Yay!  (I did not grep all external SA
> > dependencies, mind you.)
> 
> I'm guessing I'm not the only person that looks through the rules
> periodically for such things, including frivolous portions like the glob
> in /foo.*/ or the range in /bar\W{2,30}/ and wipe them out to become
> e.g. /foo/ and /bar\W{2}/

Sure, those are bad, but an entirely different beast.

> > So, does this "substantial performance penalty" using capturing
> > groups even apply to SA?
> > 
> > Is it really worth it, religiously using non-capturing grouping?
> 
> From the profiling I've seen, yes it is.  (I don't have data to share
> though, sorry).

The profiled code, does it use the special match capturing variables
*anywhere* in the entire program? The profiled and compared versions,
would that be like the equivalent of using capturing vs non-capturing in
all SA stock rules?

Not trying to be confrontational, just honestly asking and wondering
about the real impact. After all, the perlre docs specifically mention
to strongly prefer non-capturing grouping basically once only -- in the
warning paragraph about the special vars.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: (Non-) Capturing REs

Posted by Adam Katz <an...@khopis.com>.
On 10/23/2011 06:44 PM, Karsten Bräckelmann wrote:
> [...] as I read it, the warning is referring to the usage of the 
> special $&, $` and $' match capturing variables, resulting in a 
> substantial performance penalty -- and mentions the non-capturing 
> extended regex in this *context*, since it uses the same mechanism
> for the $n matches. If these special vars are used.

Using special variables like those you mentioned are particularly bad,
especially with some of the older versions of perl (I seem to recall
some of them getting big performance boosts in more recent perl
revisions).  That's not to say that the extra memory consumption from an
unnecessary grouping doesn't impact performance.

> Now, I just grepped the entire SA source code, and NONE of these
> spacial vars are used. Yay!  (I did not grep all external SA
> dependencies, mind you.)

I'm guessing I'm not the only person that looks through the rules
periodically for such things, including frivolous portions like the glob
in /foo.*/ or the range in /bar\W{2,30}/ and wipe them out to become
e.g. /foo/ and /bar\W{2}/

> So, does this "substantial performance penalty" using capturing
> groups even apply to SA?
> 
> Is it really worth it, religiously using non-capturing grouping?

From the profiling I've seen, yes it is.  (I don't have data to share
though, sorry).