You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Bowie Bailey <Bo...@BUC.com> on 2006/04/21 16:59:33 UTC

RE: Advanced regex question - backtracking vs. negative lookahead s

Jeremy Fairbrass wrote:
> 
> Let's say I want to use regex to search for the phrase "color:blue"
> within a <span> tag as in the example below (just a made-up example
> for the sake of this question):
> 
> <span style="border:0px; color:blue; font-size:small">
> 
> In this case, the "color:blue" part is preceeded by some other text
> ("border:0px") after the first quote mark, but that preceeding text
> could in fact be anything, and I want to allow for the fact that it
> could be anything.
> 
> I've read at http://www.regular-expressions.info that it's best to
> avoid backtracking if possible because that is resource-intensive.
> 
> So one possible solution would be the following:
> 
> /style="(.(?!color))+.color:blue/

This seems to me to be very inefficient.  At each point in the string
it has to read forward to check for "color".

> In other words, after the first " (quote mark) it looks for any
> character NOT followed by the word "color", and repeats that with the
> + character, until it gets to the actual word "color". I believe this
> results in no (or almost no?) backtracking. But I'm not sure if it's
> resource-intensive anyway, because of the negative lookahead - are
> negative lookaheads particularly resource intensive, when compared to
> backtracking? Is one preferable over the other?
> 
> An alternative solution would be this:
> 
> /style="[^>]+color:blue/

This looks better.  It is probably less resource-intensive than your
previous attempt and is definitely easier to read.  But why are you
looking for > when you anchor the beginning with a quote?

How about this:

    /style="[^"]+?color:blue/

This is also non-greedy, so it will start looking for the "color:blue"
match at the beginning of the string instead of having the + slurp up
everything up to the quote and then backtracking to find the match.

For SA purposes, you may want to limit the search as well.

    /style="[^"]{1,20}?color:blue/

This way, it will stop looking after 20 characters.  This prevents it
from using lots of memory if the quotes aren't closed.

> But this will certainly involve some backtracking, especially if
> there is even more text after the "color:blue" but before the
> closing > character, for example the "font-size:small" text.
> 
> So what do you think?! Which way is best, ie. most efficient or least
> resource-intensive?

-- 
Bowie

Re: Advanced regex question - backtracking vs. negative lookahead s

Posted by David Landgren <da...@landgren.net>.

Bowie Bailey wrote:

[...]

>> An alternative solution would be this:
>>
>> /style="[^>]+color:blue/
> 
> This looks better.  It is probably less resource-intensive than your
> previous attempt and is definitely easier to read.  But why are you
> looking for > when you anchor the beginning with a quote?
> 
> How about this:
> 
>     /style="[^"]+?color:blue/
> 
> This is also non-greedy, so it will start looking for the "color:blue"
> match at the beginning of the string instead of having the + slurp up
> everything up to the quote and then backtracking to find the match.

The regexp engine doesn't slurp. It just scans from left to right, 
noting "I might have to come back here" along the way.

> For SA purposes, you may want to limit the search as well.
> 
>     /style="[^"]{1,20}?color:blue/
> 
> This way, it will stop looking after 20 characters.  This prevents it
> from using lots of memory if the quotes aren't closed.

Good point.

>> But this will certainly involve some backtracking, especially if
>> there is even more text after the "color:blue" but before the
>> closing > character, for example the "font-size:small" text.

No it won't. It will scan once and quit. It never encountered any other 
alternatives that would require backtracking.

David
-- 
"It's overkill of course, but you can never have too much overkill."