You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Niall Pemberton (JIRA)" <ji...@apache.org> on 2007/11/27 04:35:44 UTC
[jira] Created: (TIKA-106) Remove dependency on Jakarta ORO - use
JDK 1.4 Regex
Remove dependency on Jakarta ORO - use JDK 1.4 Regex
----------------------------------------------------
Key: TIKA-106
URL: https://issues.apache.org/jira/browse/TIKA-106
Project: Tika
Issue Type: Task
Components: general
Reporter: Niall Pemberton
Priority: Minor
Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.
>From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r
[1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
[2] http://svn.apache.org/viewvc?view=rev&revision=516754
[3] http://svn.apache.org/viewvc?view=rev&revision=517015
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-106) Remove dependency on Jakarta ORO - use
JDK 1.4 Regex
Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niall Pemberton updated TIKA-106:
---------------------------------
Attachment: TIKA-106-remove-ORO-dependency-v2.patch
Attaching v2 of the patch (first version didn't remove the ORO dependency from pom.xml)
> Remove dependency on Jakarta ORO - use JDK 1.4 Regex
> ----------------------------------------------------
>
> Key: TIKA-106
> URL: https://issues.apache.org/jira/browse/TIKA-106
> Project: Tika
> Issue Type: Task
> Components: general
> Reporter: Niall Pemberton
> Priority: Minor
> Attachments: TIKA-106-remove-ORO-dependency-v2.patch
>
>
> Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.
> From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r
> [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
> [2] http://svn.apache.org/viewvc?view=rev&revision=516754
> [3] http://svn.apache.org/viewvc?view=rev&revision=517015
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-106) Remove dependency on Jakarta ORO - use
JDK 1.4 Regex
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-106.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.1-incubator
Patch committed in revision 606140. Thanks!
> Remove dependency on Jakarta ORO - use JDK 1.4 Regex
> ----------------------------------------------------
>
> Key: TIKA-106
> URL: https://issues.apache.org/jira/browse/TIKA-106
> Project: Tika
> Issue Type: Improvement
> Components: general
> Reporter: Niall Pemberton
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.1-incubator
>
> Attachments: TIKA-106-remove-ORO-dependency-v2.patch
>
>
> Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.
> From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r
> [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
> [2] http://svn.apache.org/viewvc?view=rev&revision=516754
> [3] http://svn.apache.org/viewvc?view=rev&revision=517015
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-106) Remove dependency on Jakarta ORO - use
JDK 1.4 Regex
Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niall Pemberton updated TIKA-106:
---------------------------------
Attachment: TIKA-106-remove-ORO-dependency-v1.patch
Attaching a patch to remove Jakarta ORO dependency. Also changes the RegexUtils method from extract() --> extractLinks() which uses a pre-compiled regular expression Pattern (thread safe according to the Javadocs). I've added a test case - which is basically copied from Nutch's OutlinkExtractor test case[1]. Also this needs the "utils" directory adding to test source before the patch can be applied.
[1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/test/org/apache/nutch/parse/TestOutlinkExtractor.java
> Remove dependency on Jakarta ORO - use JDK 1.4 Regex
> ----------------------------------------------------
>
> Key: TIKA-106
> URL: https://issues.apache.org/jira/browse/TIKA-106
> Project: Tika
> Issue Type: Task
> Components: general
> Reporter: Niall Pemberton
> Priority: Minor
> Attachments: TIKA-106-remove-ORO-dependency-v1.patch
>
>
> Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.
> From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r
> [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
> [2] http://svn.apache.org/viewvc?view=rev&revision=516754
> [3] http://svn.apache.org/viewvc?view=rev&revision=517015
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-106) Remove dependency on Jakarta ORO - use
JDK 1.4 Regex
Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niall Pemberton updated TIKA-106:
---------------------------------
Attachment: (was: TIKA-106-remove-ORO-dependency-v1.patch)
> Remove dependency on Jakarta ORO - use JDK 1.4 Regex
> ----------------------------------------------------
>
> Key: TIKA-106
> URL: https://issues.apache.org/jira/browse/TIKA-106
> Project: Tika
> Issue Type: Task
> Components: general
> Reporter: Niall Pemberton
> Priority: Minor
> Attachments: TIKA-106-remove-ORO-dependency-v2.patch
>
>
> Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.
> From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r
> [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
> [2] http://svn.apache.org/viewvc?view=rev&revision=516754
> [3] http://svn.apache.org/viewvc?view=rev&revision=517015
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-106) Remove dependency on Jakarta ORO - use
JDK 1.4 Regex
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated TIKA-106:
-------------------------------
Assignee: Jukka Zitting
Issue Type: Improvement (was: Task)
> Remove dependency on Jakarta ORO - use JDK 1.4 Regex
> ----------------------------------------------------
>
> Key: TIKA-106
> URL: https://issues.apache.org/jira/browse/TIKA-106
> Project: Tika
> Issue Type: Improvement
> Components: general
> Reporter: Niall Pemberton
> Assignee: Jukka Zitting
> Priority: Minor
> Attachments: TIKA-106-remove-ORO-dependency-v2.patch
>
>
> Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.
> From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r
> [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
> [2] http://svn.apache.org/viewvc?view=rev&revision=516754
> [3] http://svn.apache.org/viewvc?view=rev&revision=517015
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.