You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bowie Bailey <Bo...@BUC.com> on 2006/04/21 17:55:56 UTC

RE: Advanced regex question - backtracking vs. negative lookahead s

David Landgren wrote:
> Bowie Bailey wrote:
> 
> [...]
> 
> > > An alternative solution would be this:
> > > 
> > > /style="[^>]+color:blue/
> > 
> > This looks better.  It is probably less resource-intensive than
> > your previous attempt and is definitely easier to read.  But why
> > are you looking for > when you anchor the beginning with a quote?
> > 
> > How about this:
> > 
> >     /style="[^"]+?color:blue/
> > 
> > This is also non-greedy, so it will start looking for the
> > "color:blue" match at the beginning of the string instead of
> > having the + slurp up everything up to the quote and then
> > backtracking to find the match. 
> 
> The regexp engine doesn't slurp. It just scans from left to right,
> noting "I might have to come back here" along the way.

Ok, so "slurp" was a bit of a simplification.  :)

My understanding is that with [^"]+ the engine will scan from left to
right until it finds a quote.  Then, in the context of the previous
regex, it will start backtracking to find a match for "color:blue".

In any case, with the non-greedy quantifier, it will stop looking when
it finds the first "color:blue" string instead of continuing to the
end of the string.

-- 
Bowie

Re: Advanced regex question - backtracking vs. negative lookaheads

Posted by Jeremy Fairbrass <jf...@hotmail.com>.
Good point, you're completely right! Thanks for pointing that out....... :)

Cheers,
Jeremy


"John Rudd" <jr...@ucsc.edu> wrote in message 
news:a6db3852de517a026f7d99c70e64e79e@ucsc.edu...
>
> On Apr 25, 2006, at 6:33 AM, Jeremy Fairbrass wrote:
>
>>
>>
>>     /style="[^>]+color:blue/
>>
>>
>>
>>     <span style="color:blue; font-size:small; border:0px">
>>
>
> Just a small note, which may be mostly a digression but:
>
> I don't think the above regex will match that string at all.
>
> The regex, because it has a + instead of a *, requires at least one 
> character between the " and color:blue ... your string doesn't have that.
>
>
> 




Re: Advanced regex question - backtracking vs. negative lookaheads

Posted by John Rudd <jr...@ucsc.edu>.
On Apr 25, 2006, at 6:33 AM, Jeremy Fairbrass wrote:

>
>
>     /style="[^>]+color:blue/
>
>
>
>     <span style="color:blue; font-size:small; border:0px">
>

Just a small note, which may be mostly a digression but:

I don't think the above regex will match that string at all.

The regex, because it has a + instead of a *, requires at least one 
character between the " and color:blue ... your string doesn't have 
that.



Re: Advanced regex question - backtracking vs. negative lookaheads

Posted by Jeremy Fairbrass <jf...@hotmail.com>.
Thanks guys for the clarifications! My understanding of how regex worked was
the same as Bowie's, ie:
-----
> My understanding is that with [^"]+ the engine will scan from left to
> right until it finds a quote.  Then, in the context of the previous
> regex, it will start backtracking to find a match for "color:blue".
-----

I  use the free Regex Coach tool from http://www.weitz.de/regex-coach/ to
test my regex, and it works the way Bowie described above, ie. using
backtracking. In other words, using:

    /style="[^>]+color:blue/

...the [^>]+ causes the regex to go all the way to the closing > character,
then backtracks until it finds the "color:blue" part. This also agrees with
what is explained at www.regular-expressions.info which I believe is a
reliable guide to Perl regex.

Also, Bowie suggested using laziness instead:

    /style="[^"]+?color:blue/

But I believe laziness also uses backtracking, so I'm not sure there is
*much* of an advantage of this over the greedy regex shown above. Probably
the main advantage of the lazy version would be if there was little or no
text between the first quote-mark and the "color:blue" part, and/or lots of
text between "color:blue" and the last quote-mark, eg:

    <span style="color:blue; font-size:small; border:0px">

...The regex would hit this much quicker using the lazy version than the
greedy version. But I'm not sure if there really is a difference, especially
if I want to be able to hit on SPAN tags that might have more text before
the "color:blue" OR might have more text afterwards. Probably it's six of
one and half a dozen of the other, right?! Why did David describe the lazy
version as "slightly less good" than the greedy version?

Incidentally the reason I used [^>]+ rather than [^"]+ was to prevent it
from using lots of memory if there was no closing quote - as an alternative
to using {1,20}.

In any case, both Bowie and David agree that my first solution using
(.(?!color))+ is a really bad idea, and that was the main thing I wanted to
know! :)

Thanks,
Jeremy




"Bowie Bailey" <Bo...@BUC.com> wrote in message
news:4766EEE585A6D311ADF500E018C154E30213393E@bnifex.cis.buc.com...
> David Landgren wrote:
>> Bowie Bailey wrote:
>>
>> [...]
>>
>> > > An alternative solution would be this:
>> > >
>> > > /style="[^>]+color:blue/
>> >
>> > This looks better.  It is probably less resource-intensive than
>> > your previous attempt and is definitely easier to read.  But why
>> > are you looking for > when you anchor the beginning with a quote?
>> >
>> > How about this:
>> >
>> >     /style="[^"]+?color:blue/
>> >
>> > This is also non-greedy, so it will start looking for the
>> > "color:blue" match at the beginning of the string instead of
>> > having the + slurp up everything up to the quote and then
>> > backtracking to find the match.
>>
>> The regexp engine doesn't slurp. It just scans from left to right,
>> noting "I might have to come back here" along the way.
>
> Ok, so "slurp" was a bit of a simplification.  :)
>
> My understanding is that with [^"]+ the engine will scan from left to
> right until it finds a quote.  Then, in the context of the previous
> regex, it will start backtracking to find a match for "color:blue".
>
> In any case, with the non-greedy quantifier, it will stop looking when
> it finds the first "color:blue" string instead of continuing to the
> end of the string.
>
> -- 
> Bowie
>