You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org> on 2012/02/07 16:26:59 UTC

[jira] [Updated] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

     [ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1259:
---------------------------------

    Attachment: NUTCH-1259-1.5-1.patch

Here's a patch for 1.5. Comments? We have this running in production and it does works very good. It completely solves the big problem of ending up with many thousands of crap content-types.

I'll commit this one tomorrow unless there are objections.
                
> TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1259
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1259
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1259-1.5-1.patch
>
>
> The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira