You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Iain Lopata (JIRA)" <ji...@apache.org> on 2015/04/17 02:51:58 UTC

[jira] [Commented] (NUTCH-1991) Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection

    [ https://issues.apache.org/jira/browse/NUTCH-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499066#comment-14499066 ] 

Iain Lopata commented on NUTCH-1991:
------------------------------------

I am not running 2.0 or later so can not debug or test those versions at this time, but a quick look at the code suggests that the same problem may have propogated back to earlier steps in the process also.  Perhaps all calls to tika.detect need to be reviewed to see if they should in fact be using this.mimeTypes.

> Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1991
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1991
>             Project: Nutch
>          Issue Type: Bug
>          Components: util
>            Reporter: Iain Lopata
>            Priority: Minor
>         Attachments: NUTCH-1991-1.6.patch
>
>
> From Nutch Version 1.5 onwards the MimeUtil.java class that acts as a facade to Tika to perform mime type detection uses a process that attempts a match using the mimetype returned by the server, the filename and the content. NUTCH-1045 provided for the use of an external tika-mimetype.xml file which provides the configuration for this process.  However, the content based detection did not use this file, but instead reverted to using the configuration included in the tika library.  Consequently, any content based match rules added to the nutch version of the configuration file were not used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)