You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/02/17 17:04:24 UTC

[jira] Created: (NUTCH-967) Upgrade to Tika 0.9

Upgrade to Tika 0.9
-------------------

                 Key: NUTCH-967
                 URL: https://issues.apache.org/jira/browse/NUTCH-967
             Project: Nutch
          Issue Type: Task
          Components: parser
    Affects Versions: 1.3, 2.0
            Reporter: Markus Jelsma
             Fix For: 1.3, 2.0




-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016839#comment-13016839 ] 

Julien Nioche commented on NUTCH-967:
-------------------------------------

OK guys, I've attached a new version of the patch which updates the content of plugin.xml (thanks Gabriele for reminding me). The tests now run succesfully.

The reason why I got it to work on my other machine was probably that I had done it properly then (grin).

There might be a few things we could do on the tika-parser like getting rid of the duplicate tika config object but we'll track that in a separate JIRA if necessary.

Leaving the issue open as we still need to update the trunk. In the meantime I'll commit this patch on 1.3 tomorrow unless of course someone finds a good reason not to.

Thanks!

Julien 


 

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012993#comment-13012993 ] 

Markus Jelsma commented on NUTCH-967:
-------------------------------------

I applied your patch (seems i didn't properly reconfigure plugin's own ivy.xml). Doing the usual ant test does pass all tests. Doing test manually on parse-zip's build.xml fails lik this:

markus@midas:~/projects/apache/nutch/branches/branch-1.3$ ant -f src/plugin/parse-zip/build.xml test
Buildfile: src/plugin/parse-zip/build.xml

BUILD FAILED
/home/markus/projects/apache/nutch/branches/branch-1.3/src/plugin/parse-zip/build.xml:20: The following error occurred while executing this line:
/home/markus/projects/apache/nutch/branches/branch-1.3/src/plugin/build-plugin.xml:46: Problem: failed to create task or type antlib:org.apache.ivy.ant:settings
Cause: The name is undefined.
Action: Check the spelling.
Action: Check that any custom tasks/types have been declared.
Action: Check that any <presetdef>/<macrodef> declarations have taken place.
No types or tasks have been defined in this namespace yet

This appears to be an antlib declaration. 
Action: Check that the implementing library exists in one of:
        -/usr/share/ant/lib
        -/home/markus/.ant/lib
        -a directory added on the command line with the -lib argument


Total time: 0 seconds

Is this what you're getting as well? I have the same Java version, my ant is a bit more recent though:
Apache Ant version 1.7.1 compiled on September 8 2010



> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016817#comment-13016817 ] 

Markus Jelsma edited comment on NUTCH-967 at 4/7/11 11:52 AM:
--------------------------------------------------------------

Julien, your patch didn't include the Apache James Mime4j jar. It seems Tika 0.9 (or earlier) has this dependancy. I've modified parse-tika's plugin.xml and ivy.xml to use 0.9 parsers and load and copy the Mime4j jar. This patch is built on top of your patch! I can now parse without having parse-html enabled and everything seems fine. 

Doing an ant test still looks fine but our issue with parse-zip and your failing tests remain. 

EDIT: for some reason Apache Mime4j 0.6 and tike-parsers 0.7 are copied over as well and there is a tika-core-0.7 going too!? Double checked for occurences but can't find any. Bahh

      was (Author: markus17):
    Julien, your patch didn't include the Apache James Mime4j jar. It seems Tika 0.9 (or earlier) has this dependancy. I've modified parse-tika's plugin.xml and ivy.xml to use 0.9 parsers and load and copy the Mime4j jar. This patch is built on top of your patch! I can now parse without having parse-html enabled and everything seems fine. 

Doing an ant test still looks fine but our issue with parse-zip and your failing tests remain. 
  
> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008453#comment-13008453 ] 

Markus Jelsma commented on NUTCH-967:
-------------------------------------

That didn't show up in test nor in a crawl, but i'm not using parse-zip anyway. How to procede with a fix?

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-967:
--------------------------------

    Attachment: NUTCH-967-1.3-3.patch

Updated patch which changes plugin.xml so that it reflects the dependencies used by Tika 0.9

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016832#comment-13016832 ] 

Julien Nioche commented on NUTCH-967:
-------------------------------------

bq. Julien, why doesn't your patch modify tika-parse plugin.xml to use tika-parsers-0.9 instead of tika-parsers-0.7?

Why? Because I forgot to modify it, that's why. Note that the dependencies inherited from Tika should be update in plugin.xml, not just the main Tika one.

bq. Julien, your patch didn't include the Apache James Mime4j jar. It seems Tika 0.9 (or earlier) has this dependency. 

It seems to be pulled along with the other dependencies (try 'ant -f src/plugin/parse-tika/build-ivy.xml' and look at the content of the lib dir in parse-tika) so I don't think we need to do anything special about it - apart from adding it to plugin.xml like everything else.

Will look at the tika-parser.0.7 issue






> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016855#comment-13016855 ] 

Markus Jelsma commented on NUTCH-967:
-------------------------------------

Dependencies are in place now and all is well in 1.3 and various Boilerplate extractors work as well. If you commit, perhaps it's wise to add parse-tika for HTML mime's to parse-plugins (now there's only parse-html). Then we only have to set parse.includes in our configuration to switch between parse-tika and parse-html (if there is any need to).

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012973#comment-13012973 ] 

Julien Nioche commented on NUTCH-967:
-------------------------------------

Surprisingly I managed to run the tests on a different machine. The versions of java and ant on the laptop where parse-zip test fails are : 

java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Server VM (build 19.1-b02, mixed mode)

Apache Ant version 1.7.1 compiled on July 2 2010


> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056992#comment-13056992 ] 

Hudson commented on NUTCH-967:
------------------------------

Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/])
    

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-967:
--------------------------------

    Attachment: NUTCH-967-1.3-2.patch

Patch on to include Apache James Mime4j as Tika depedancy. 

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-967:
--------------------------------

    Attachment: NUTCH-967-1.3.patch

patch for Tika 0.9 on Nutch 1.3

currently fails when running the tests e.g. ant -f src/plugin/parse-zip/build.xml test

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017023#comment-13017023 ] 

Julien Nioche commented on NUTCH-967:
-------------------------------------

bq. perhaps it's wise to add parse-tika for HTML mime's to parse-plugins (now there's only parse-htm. Then we only have to set parse.includes in our configuration to switch between parse-tika and parse-html (if there is any need to).

If you look at parse-tika's plugin.xml you'll see that it is associated with any mimetype (value=*) so it does not need to be explicitely linked to the html mimetype in parse-plugins. I don't remember the exact mechanism, I think it tries to use any parser set explicitly for a given mime-type and if there aren't any or if they fail then it tries the default parsers.

We need to test parse-tika on html to make sure that it behaves exactly like the legacy html parser, which wasn't the case with 0.7; in the meantime it is advised to use parse-html and so we'll leave parse-html in parse-plugins.xml

Hope it makes sense




> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche reassigned NUTCH-967:
-----------------------------------

    Assignee: Julien Nioche

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012999#comment-13012999 ] 

Julien Nioche commented on NUTCH-967:
-------------------------------------

Strange. The call to ant test should have failed. Could you try calling ant test-plugins?
Am not getting at the same error as you : the test for parse-zip runs but fails. in your case there seems to be some configuration problem as it does not start at all
 

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016465#comment-13016465 ] 

Gabriele Kahlout commented on NUTCH-967:
----------------------------------------

Julien, why doesn't your patch modify tika-parse plugin.xml to use tika-parsers-0.9 instead of tika-parsers-0.7?
Trying to do so I get exception (for both html and pdfs): 

Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

It's enough to set it back to 0.7 to have it work. This is not an issue with html only but also pdfs.

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013006#comment-13013006 ] 

Markus Jelsma commented on NUTCH-967:
-------------------------------------

ant test-plugins

BUILD SUCCESSFUL
Total time: 20 seconds


> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-967.
---------------------------------

    Resolution: Fixed

trunk : Committed revision 1090181
1.3 : Committed revision 1090182



> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016817#comment-13016817 ] 

Markus Jelsma edited comment on NUTCH-967 at 4/7/11 11:39 AM:
--------------------------------------------------------------

Julien, your patch didn't include the Apache James Mime4j jar. It seems Tika 0.9 (or earlier) has this dependancy. I've modified parse-tika's plugin.xml and ivy.xml to use 0.9 parsers and load and copy the Mime4j jar. This patch is built on top of your patch! I can now parse without having parse-html enabled and everything seems fine. 

Doing an ant test still looks fine but our issue with parse-zip and your failing tests remain. 

      was (Author: markus17):
    Patch on to include Apache James Mime4j as Tika depedancy. 
  
> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-967) Upgrade to Tika 0.9

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995910#comment-12995910 ] 

Julien Nioche commented on NUTCH-967:
-------------------------------------

Note : Tika 0.9 cause the parse-zip plugin to crash in 1.3 (it hasn't been ported to 2.0 yet)

> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>             Fix For: 1.3, 2.0
>
>


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-967) Upgrade to Tika 0.9

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-967.
-------------------------------


> Upgrade to Tika 0.9
> -------------------
>
>                 Key: NUTCH-967
>                 URL: https://issues.apache.org/jira/browse/NUTCH-967
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira