You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by "Daniel F. Savarese" <df...@savarese.org> on 2009/01/19 06:06:12 UTC
Re: pattern and infinite loop
In message <bd...@mail.gmail.com>, no sp
am writes:
>I'm using this pattern:
>p5c.compile(".*?<td\\s+.+<span\\s+class=\"nametext\">"+
>".+?<strong>(.+?)</strong></font>.+?Profile\\s+Views",
>Perl5Compiler.SINGLELINE_MASK);
>
>to try and pull genres out of myspace pages. However some pages like this
...
>How can I prevent these loops?
Presumably, you're concerned only with the capture group (containing
the genre), so rewrite the expression along the following lines to
avoid the ambiguous/excessive backtracking:
p5c.compile("<span\\s+class=\"nametext\">[^<]*</span><br>[^<]*<font[^>]*>"+
"<strong>([^<]+)</strong></font>",
Perl5Compiler.SINGLELINE_MASK);
---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org
Re: pattern and infinite loop
Posted by no spam <mr...@gmail.com>.
I'm using contains. Strange .. not sure what's going on.
> Are you using contains() or match()? If you're using match(), then
> switch to contains() and it should work. Here's my sanity check for
> the pattern (to avoid having to write a Java test program):
>
> ~> wget -O - http://www.myspace.com/pain 2> /dev/null | perl -e '@txt =
> <STDIN>; $txt = join("", @txt); $txt =~
> m#<span\s+class="nametext">[^<]*</span><br>[^<]*<font\s[^>]*><strong>([^<]+)</strong></font>#si;
> print "$1\n";'
>
> Metal / Industrial
>
Ah yes I figured that was the issue after I saw your pattern. The bits I
don't understand though is how [^<]* is working. What exactly does that
part of the pattern mean?
In any case, the key to prevent excessive backtracking is to make the
> pattern as specific as possible. The original pattern posed problems
> because of the leading .* as well as following .+ pattern elements which
> caused a lot of backtracking.
>
>
Re: pattern and infinite loop
Posted by no spam <mr...@gmail.com>.
>
> Presumably, you're concerned only with the capture group (containing
> the genre), so rewrite the expression along the following lines to
> avoid the ambiguous/excessive backtracking:
>
> p5c.compile("<span\\s+class=\"nametext\">[^<]*</span><br>[^<]*<font[^>]*>"+
> "<strong>([^<]+)</strong></font>",
> Perl5Compiler.SINGLELINE_MASK);
>
Yes that's correct. This pattern prevented the looping it just didn't match
for that particular page. I'll have to digest this pattern a bit more in my
head :o)