You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/03/16 04:09:58 UTC

[jira] Created: (NUTCH-233) wrong regular expression hang reduce process for ever

wrong regular expression hang reduce process for ever 
------------------------------------------------------

         Key: NUTCH-233
         URL: http://issues.apache.org/jira/browse/NUTCH-233
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
    Priority: Blocker
     Fix For: 0.8-dev


Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
May be it was missed to change it when the regular expression packages was changed.
The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:

I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
However may people can review it and can suggest improvements, since the old regex would match :
"abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
"abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ] 

Jerome Charron commented on NUTCH-233:
--------------------------------------

Stefan,

I have created a small unit test for urlfilter-regexp and I doesn't notice any incompatibility in java.util.regex with this regexp. Could you please provide the urls that cause problem so that I can add them to me unit tests.
Thanks

Jérôme

> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>          Key: NUTCH-233
>          URL: http://issues.apache.org/jira/browse/NUTCH-233
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>      Fix For: 0.8-dev

>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
> May be it was missed to change it when the regular expression packages was changed.
> The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370686 ] 

Stefan Groschupf commented on NUTCH-233:
----------------------------------------

Sorry, I haven't such url since it happens until reducing a fetch. Reducing provides no logging and map data will be deleted if the job fails because a timeout. :(


> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>          Key: NUTCH-233
>          URL: http://issues.apache.org/jira/browse/NUTCH-233
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>      Fix For: 0.8-dev

>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
> May be it was missed to change it when the regular expression packages was changed.
> The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12427677 ] 
            
Otis Gospodnetic commented on NUTCH-233:
----------------------------------------

I haven't noticed this regexp being a problem so far either, but maybe I've just been lucky not to have run into bot-trap site yet.  Is this still a problem for you, Stefan?

> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>                 Key: NUTCH-233
>                 URL: http://issues.apache.org/jira/browse/NUTCH-233
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.9.0
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
> May be it was missed to change it when the regular expression packages was changed.
> The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] 
            
Stefan Groschupf commented on NUTCH-233:
----------------------------------------

I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages will run into this problem. The problems are for example from spam bot generated urls. 



> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>                 Key: NUTCH-233
>                 URL: http://issues.apache.org/jira/browse/NUTCH-233
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8-dev
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.9-dev
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
> May be it was missed to change it when the regular expression packages was changed.
> The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] 
            
Stefan Groschupf commented on NUTCH-233:
----------------------------------------

Hi Otis, 
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls that for example comes from link farms the crawler runs into. 

> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>                 Key: NUTCH-233
>                 URL: http://issues.apache.org/jira/browse/NUTCH-233
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.9.0
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
> May be it was missed to change it when the regular expression packages was changed.
> The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-233) wrong regular expression hang reduce process for ever

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-233?page=all ]

Sami Siren updated NUTCH-233:
-----------------------------

    Fix Version/s: 0.9-dev
                       (was: 0.8-dev)

> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>                 Key: NUTCH-233
>                 URL: http://issues.apache.org/jira/browse/NUTCH-233
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8-dev
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.9-dev
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. 
> May be it was missed to change it when the regular expression packages was changed.
> The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga     at java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira