You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bowie Bailey <Bo...@BUC.com> on 2006/04/21 17:55:56 UTC
RE: Advanced regex question - backtracking vs. negative lookahead
s
David Landgren wrote:
> Bowie Bailey wrote:
>
> [...]
>
> > > An alternative solution would be this:
> > >
> > > /style="[^>]+color:blue/
> >
> > This looks better. It is probably less resource-intensive than
> > your previous attempt and is definitely easier to read. But why
> > are you looking for > when you anchor the beginning with a quote?
> >
> > How about this:
> >
> > /style="[^"]+?color:blue/
> >
> > This is also non-greedy, so it will start looking for the
> > "color:blue" match at the beginning of the string instead of
> > having the + slurp up everything up to the quote and then
> > backtracking to find the match.
>
> The regexp engine doesn't slurp. It just scans from left to right,
> noting "I might have to come back here" along the way.
Ok, so "slurp" was a bit of a simplification. :)
My understanding is that with [^"]+ the engine will scan from left to
right until it finds a quote. Then, in the context of the previous
regex, it will start backtracking to find a match for "color:blue".
In any case, with the non-greedy quantifier, it will stop looking when
it finds the first "color:blue" string instead of continuing to the
end of the string.
--
Bowie
Re: Advanced regex question - backtracking vs. negative lookaheads
Posted by Jeremy Fairbrass <jf...@hotmail.com>.
Good point, you're completely right! Thanks for pointing that out....... :)
Cheers,
Jeremy
"John Rudd" <jr...@ucsc.edu> wrote in message
news:a6db3852de517a026f7d99c70e64e79e@ucsc.edu...
>
> On Apr 25, 2006, at 6:33 AM, Jeremy Fairbrass wrote:
>
>>
>>
>> /style="[^>]+color:blue/
>>
>>
>>
>> <span style="color:blue; font-size:small; border:0px">
>>
>
> Just a small note, which may be mostly a digression but:
>
> I don't think the above regex will match that string at all.
>
> The regex, because it has a + instead of a *, requires at least one
> character between the " and color:blue ... your string doesn't have that.
>
>
>
Re: Advanced regex question - backtracking vs. negative lookaheads
Posted by John Rudd <jr...@ucsc.edu>.
On Apr 25, 2006, at 6:33 AM, Jeremy Fairbrass wrote:
>
>
> /style="[^>]+color:blue/
>
>
>
> <span style="color:blue; font-size:small; border:0px">
>
Just a small note, which may be mostly a digression but:
I don't think the above regex will match that string at all.
The regex, because it has a + instead of a *, requires at least one
character between the " and color:blue ... your string doesn't have
that.
Re: Advanced regex question - backtracking vs. negative lookaheads
Posted by Jeremy Fairbrass <jf...@hotmail.com>.
Thanks guys for the clarifications! My understanding of how regex worked was
the same as Bowie's, ie:
-----
> My understanding is that with [^"]+ the engine will scan from left to
> right until it finds a quote. Then, in the context of the previous
> regex, it will start backtracking to find a match for "color:blue".
-----
I use the free Regex Coach tool from http://www.weitz.de/regex-coach/ to
test my regex, and it works the way Bowie described above, ie. using
backtracking. In other words, using:
/style="[^>]+color:blue/
...the [^>]+ causes the regex to go all the way to the closing > character,
then backtracks until it finds the "color:blue" part. This also agrees with
what is explained at www.regular-expressions.info which I believe is a
reliable guide to Perl regex.
Also, Bowie suggested using laziness instead:
/style="[^"]+?color:blue/
But I believe laziness also uses backtracking, so I'm not sure there is
*much* of an advantage of this over the greedy regex shown above. Probably
the main advantage of the lazy version would be if there was little or no
text between the first quote-mark and the "color:blue" part, and/or lots of
text between "color:blue" and the last quote-mark, eg:
<span style="color:blue; font-size:small; border:0px">
...The regex would hit this much quicker using the lazy version than the
greedy version. But I'm not sure if there really is a difference, especially
if I want to be able to hit on SPAN tags that might have more text before
the "color:blue" OR might have more text afterwards. Probably it's six of
one and half a dozen of the other, right?! Why did David describe the lazy
version as "slightly less good" than the greedy version?
Incidentally the reason I used [^>]+ rather than [^"]+ was to prevent it
from using lots of memory if there was no closing quote - as an alternative
to using {1,20}.
In any case, both Bowie and David agree that my first solution using
(.(?!color))+ is a really bad idea, and that was the main thing I wanted to
know! :)
Thanks,
Jeremy
"Bowie Bailey" <Bo...@BUC.com> wrote in message
news:4766EEE585A6D311ADF500E018C154E30213393E@bnifex.cis.buc.com...
> David Landgren wrote:
>> Bowie Bailey wrote:
>>
>> [...]
>>
>> > > An alternative solution would be this:
>> > >
>> > > /style="[^>]+color:blue/
>> >
>> > This looks better. It is probably less resource-intensive than
>> > your previous attempt and is definitely easier to read. But why
>> > are you looking for > when you anchor the beginning with a quote?
>> >
>> > How about this:
>> >
>> > /style="[^"]+?color:blue/
>> >
>> > This is also non-greedy, so it will start looking for the
>> > "color:blue" match at the beginning of the string instead of
>> > having the + slurp up everything up to the quote and then
>> > backtracking to find the match.
>>
>> The regexp engine doesn't slurp. It just scans from left to right,
>> noting "I might have to come back here" along the way.
>
> Ok, so "slurp" was a bit of a simplification. :)
>
> My understanding is that with [^"]+ the engine will scan from left to
> right until it finds a quote. Then, in the context of the previous
> regex, it will start backtracking to find a match for "color:blue".
>
> In any case, with the non-greedy quantifier, it will stop looking when
> it finds the first "color:blue" string instead of continuing to the
> end of the string.
>
> --
> Bowie
>