You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Rod Taylor (JIRA)" <ji...@apache.org> on 2005/12/31 19:14:02 UTC

[jira] Created: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

Use standard Java Regex library rather than org.apache.oro.text.regex
---------------------------------------------------------------------

         Key: NUTCH-160
         URL: http://issues.apache.org/jira/browse/NUTCH-160
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Rod Taylor
 Attachments: regex.patch

org.apache.oro.text.regex is based on perl 5.003 which has some corner cases which perform poorly. The standard regular expression libraries for Java (1.4 and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361472 ] 

Rod Taylor commented on NUTCH-160:
----------------------------------

This patch also appears to eliminate the issue reported on November 18th to the mailing list with the subject "Urlfilter bug (doesn't return on long URLs)" regarding abnormally long urls causing a timeout in the URLFilter.

> Use standard Java Regex library rather than org.apache.oro.text.regex
> ---------------------------------------------------------------------
>
>          Key: NUTCH-160
>          URL: http://issues.apache.org/jira/browse/NUTCH-160
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>  Attachments: regex.patch
>
> org.apache.oro.text.regex is based on perl 5.003 which has some corner cases which perform poorly. The standard regular expression libraries for Java (1.4 and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-160?page=all ]
     
Doug Cutting resolved NUTCH-160:
--------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

I just committed this patch.  Thanks!

> Use standard Java Regex library rather than org.apache.oro.text.regex
> ---------------------------------------------------------------------
>
>          Key: NUTCH-160
>          URL: http://issues.apache.org/jira/browse/NUTCH-160
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>      Fix For: 0.8-dev
>  Attachments: regex.patch
>
> org.apache.oro.text.regex is based on perl 5.003 which has some corner cases which perform poorly. The standard regular expression libraries for Java (1.4 and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-160?page=all ]

Rod Taylor updated NUTCH-160:
-----------------------------

    Attachment: regex.patch

Patch for RegexURLFilter.java

> Use standard Java Regex library rather than org.apache.oro.text.regex
> ---------------------------------------------------------------------
>
>          Key: NUTCH-160
>          URL: http://issues.apache.org/jira/browse/NUTCH-160
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>  Attachments: regex.patch
>
> org.apache.oro.text.regex is based on perl 5.003 which has some corner cases which perform poorly. The standard regular expression libraries for Java (1.4 and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ] 

Doug Cutting commented on NUTCH-160:
------------------------------------

+1

I like this patch.  I don't see a need for us to use oro anywhere, since Java now has good builtin regex support.  And Java's regex's are faster in many cases, not just this:

http://tbray.org/ongoing/When/200x/2004/08/22/PJre

There are a few places in which Java's regex's are incompatible with Perl 5 regex's, documented in the "Comparison to Perl 5" section of:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

So this change is not completely back-compatible.

Any objections?

> Use standard Java Regex library rather than org.apache.oro.text.regex
> ---------------------------------------------------------------------
>
>          Key: NUTCH-160
>          URL: http://issues.apache.org/jira/browse/NUTCH-160
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>  Attachments: regex.patch
>
> org.apache.oro.text.regex is based on perl 5.003 which has some corner cases which perform poorly. The standard regular expression libraries for Java (1.4 and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira