You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by no spam <mr...@gmail.com> on 2009/01/19 05:09:47 UTC

pattern and infinite loop

I'm using this pattern:
p5c.compile(".*?<td\\s+.+<span\\s+class=\"nametext\">"+
".+?<strong>(.+?)</strong></font>.+?Profile\\s+Views",
Perl5Compiler.SINGLELINE_MASK);

to try and pull genres out of myspace pages.  However some pages like this
result in infinite loops:

http://www.myspace.com/pain

How can I prevent these loops?

Re: pattern and infinite loop

Posted by no spam <mr...@gmail.com>.
I'm using contains.  Strange .. not sure what's going on.

> Are you using contains() or match()?  If you're using match(), then
> switch to contains() and it should work.  Here's my sanity check for
> the pattern (to avoid having to write a Java test program):
>
> ~> wget -O - http://www.myspace.com/pain  2> /dev/null | perl -e '@txt =
> <STDIN>; $txt = join("", @txt); $txt =~
> m#<span\s+class="nametext">[^<]*</span><br>[^<]*<font\s[^>]*><strong>([^<]+)</strong></font>#si;
> print "$1\n";'
>
>  Metal / Industrial
>

Ah yes I figured that was the issue after I saw your pattern.   The bits I
don't understand though is how [^<]* is working.  What exactly does that
part of the pattern mean?

In any case, the key to prevent excessive backtracking is to make the
> pattern as specific as possible.  The original pattern posed problems
> because of the leading .* as well as following .+ pattern elements which
> caused a lot of backtracking.
>
>

Re: pattern and infinite loop

Posted by no spam <mr...@gmail.com>.
>
> Presumably, you're concerned only with the capture group (containing
> the genre), so rewrite the expression along the following lines to
> avoid the ambiguous/excessive backtracking:
>
> p5c.compile("<span\\s+class=\"nametext\">[^<]*</span><br>[^<]*<font[^>]*>"+
>            "<strong>([^<]+)</strong></font>",
>            Perl5Compiler.SINGLELINE_MASK);
>

Yes that's correct.  This pattern prevented the looping it just didn't match
for that particular page.  I'll have to digest this pattern a bit more in my
head :o)