You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/07/13 12:11:00 UTC

[jira] [Created] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

MimeUtil to rely on default config provided by Tika
---------------------------------------------------

                 Key: NUTCH-1045
                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.4
            Reporter: Julien Nioche
             Fix For: 2.0


We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067794#comment-13067794 ] 

Julien Nioche commented on NUTCH-1045:
--------------------------------------

{quote}
May be because the empty file is still included in the job file?
{quote}

Do you mean that the job file contains an empty tika-mimetypes.xml? Would you mind running the parsing again after it has been removed + add a debug line on 175 to check that the Tika detection is done?

{quote}
i'm a big proponent of detection and never trusting meta tags or headers returned.
{quote}

Having the option to choose which strategy to adopt would be better, a bit like what we need to do for the language id. Have recently seen cases with rss feeds where the server simply says it is text/xml (which in a way is true) whereas Tika would have detected that it was an application/rss+xml. The new Detection API in Tika would allow us to do that rather neatly


> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067779#comment-13067779 ] 

Markus Jelsma commented on NUTCH-1045:
--------------------------------------

Strange, the only errors i can find in the various map task logs are the usual pdf files. May be because the empty file is still included in the job file?

On your last point: i'm a big proponent of detection and never trusting meta tags or headers returned.

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067773#comment-13067773 ] 

Julien Nioche commented on NUTCH-1045:
--------------------------------------

you should see a message in the logs at the beginning of a task saying something like 

{quote}
 LOG.error("Can't load mime.types.file : tika-mimetypes.xml using Tika's default"); 
{quote}

and getting the right amount of mime-type counts (although I am not sure that we are currently reporting these in 1.4). 

The problem is that in most cases you'll get the mime-type guessed because of the info returned by the server, not because of Tika's detection. The best way of making sure that Tika successfully relies on the default setting for guessing would be to add a LOG entry on line 175 in MimeUtil with the Mimetype found.

BTW this class is in serious need of refactoring as the underlying Tika API has changed a lot. The logic around what strategies to use e.g. trust the metadata returned by the server? trust Tika's detection? etc... should be reimplemented using the Detector implementations. Will open a new JIRA for this



> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064520#comment-13064520 ] 

Markus Jelsma commented on NUTCH-1045:
--------------------------------------

+1, works like a charm!

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1045.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, nutchgora
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067912#comment-13067912 ] 

Julien Nioche commented on NUTCH-1045:
--------------------------------------

Great, seems to be working fine then. Thanks for taking the time to test.
Re-numerous log entries : I'd have expected the MimeRegistry initialisation to be done once and then cached, however since we're in the Fetcher it is probably done once per thread (10 above - was it the number of threads you had?). Anyway will have another look tomorrow before committing, thanks again!

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067918#comment-13067918 ] 

Markus Jelsma commented on NUTCH-1045:
--------------------------------------

No, that's a few hundred threads, it was just a snippet from the log. I'll run it again should you need to. If not, i'll run it anyway whenever you commit.

Cheers!

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067861#comment-13067861 ] 

Markus Jelsma edited comment on NUTCH-1045 at 7/19/11 5:50 PM:
---------------------------------------------------------------

Here it is, in the fetcher's mapper job:
INFO org.apache.nutch.util.MimeUtil: Detected MIME-type: application/xhtml+xml

there's also a lot of stuff in the log:
{code}
2011-07-19 17:43:20,837 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,845 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,846 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,856 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,857 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,858 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,859 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,860 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,861 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,863 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found

{code}

      was (Author: markus17):
    Here it is, in the fetcher's mapper job:
INFO org.apache.nutch.util.MimeUtil: Detected MIME-type: application/xhtml+xml

  
> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1045:
---------------------------------

    Attachment: NUTCH-1045-1.4.patch

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067745#comment-13067745 ] 

Markus Jelsma commented on NUTCH-1045:
--------------------------------------

I'm uploading the patched job file now. There's a large parse job about to start. 

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070930#comment-13070930 ] 

Hudson commented on NUTCH-1045:
-------------------------------

Integrated in Nutch-trunk #1557 (See [https://builds.apache.org/job/Nutch-trunk/1557/])
    NUTCH-1045 Mimeutil uses default Tika config unless overriden

jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1150670
Files : 
* /nutch/trunk/conf/tika-mimetypes.xml
* /nutch/trunk/src/java/org/apache/nutch/util/MimeUtil.java
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/CHANGES.txt


> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065806#comment-13065806 ] 

Julien Nioche commented on NUTCH-1045:
--------------------------------------

Does not pass the tests - will investigate later

{quote}
Testsuite: org.apache.nutch.protocol.TestContent
Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 0.127 sec
------------- Standard Output ---------------
2011-07-15 10:16:34,767 INFO  conf.Configuration (Configuration.java:getConfResourceAsInputStream(941)) - tika-mimetypes.xml not found
2011-07-15 10:16:34,784 ERROR util.MimeUtil (MimeUtil.java:<init>(71)) - Can't load mime.types.file : tika-mimetypes.xml using Tika's default
------------- ---------------- ---------------

Testcase: testContent took 0.109 sec
Testcase: testGetContentType took 0.005 sec
	FAILED
null expected:<[text/html]> but was:<[application/octet-stream]>
junit.framework.ComparisonFailure: null expected:<[text/html]> but was:<[application/octet-stream]>
	at org.apache.nutch.protocol.TestContent.testGetContentType(TestContent.java:72)
{quote}

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1045:
---------------------------------

    Attachment: NUTCH-1045-1.4-v2.patch

New version of the patch which passes the tests.
Any volunteers to test that in distributed mode? 

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067755#comment-13067755 ] 

Markus Jelsma commented on NUTCH-1045:
--------------------------------------

Are you looking for something specific? It's still running smoothly parsing ~100 documents per second.

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067861#comment-13067861 ] 

Markus Jelsma commented on NUTCH-1045:
--------------------------------------

Here it is, in the fetcher's mapper job:
INFO org.apache.nutch.util.MimeUtil: Detected MIME-type: application/xhtml+xml


> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067861#comment-13067861 ] 

Markus Jelsma edited comment on NUTCH-1045 at 7/19/11 5:50 PM:
---------------------------------------------------------------

Here it is, in the fetcher's mapper job:
INFO org.apache.nutch.util.MimeUtil: Detected MIME-type: application/xhtml+xml

there's also a lot of stuff in the log:
{code}
2011-07-19 17:43:20,834 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,835 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,837 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,845 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,846 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,856 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,857 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,858 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,859 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,860 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,861 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,863 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,895 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,896 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,899 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default


{code}

      was (Author: markus17):
    Here it is, in the fetcher's mapper job:
INFO org.apache.nutch.util.MimeUtil: Detected MIME-type: application/xhtml+xml

there's also a lot of stuff in the log:
{code}
2011-07-19 17:43:20,837 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,845 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,846 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,856 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,857 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,858 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,859 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,860 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found
2011-07-19 17:43:20,861 ERROR org.apache.nutch.util.MimeUtil: Can't load mime.types.file : tika-mimetypes.xml using Tika's default
2011-07-19 17:43:20,863 INFO org.apache.hadoop.conf.Configuration: tika-mimetypes.xml not found

{code}
  
> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-1045.
----------------------------------

    Resolution: Fixed
      Assignee: Julien Nioche

1.4 : Committed revision 1150669
trunk : Committed revision 1150670

Thanks Markus for testing


> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1045:
---------------------------------

             Priority: Minor  (was: Major)
    Affects Version/s: 2.0
        Fix Version/s: 1.4

> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't specified one or if it can't be loaded then we should rely on Tika's default. This way we won't need to provide conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira