You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/07/02 12:27:50 UTC
[jira] Created: (NUTCH-840) Port tests from parse-html to
parse-tika
Port tests from parse-html to parse-tika
----------------------------------------
Key: NUTCH-840
URL: https://issues.apache.org/jira/browse/NUTCH-840
Project: Nutch
Issue Type: Task
Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 2.0
We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-840) Port tests from parse-html to
parse-tika
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-840:
--------------------------------
Attachment: NUTCH-840.patch
Patch which adds the HTML tests to the Tika Parser
The tests currently rely on some DOM related code from Neko-HTML which introduces a dependency to the plugin lib-nekohtml.
Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will be removed shortly. Once this is done we can delete lib-nekohtml as well then either :
a) add the neko jar to the parse-tika lib via IVY
b) replace it with another implementation already available from the tika dependencies or the main Nutch dependencies (e.g. dom4j)
> Port tests from parse-html to parse-tika
> ----------------------------------------
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
> Issue Type: Task
> Components: parser
> Affects Versions: 1.1
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-840.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.