You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/11/05 21:03:00 UTC

[jira] [Commented] (NUTCH-2033) parse-tika skips valid documents.

    [ https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239722#comment-16239722 ] 

Sebastian Nagel commented on NUTCH-2033:
----------------------------------------

Should this be fixed inside Nutch? Which composite types are supported is known only in Tika - would be painful to update this list every time the Tika dependency is upgraded. But could implement this as a fall-back: if no parser is found, retry as "application/xml".

> parse-tika skips valid documents.
> ---------------------------------
>
>                 Key: NUTCH-2033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2033
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>            Assignee: Lewis John McGibbney
>              Labels: mime-type, parse-tika, parser, tika
>             Fix For: 1.14
>
>
> If we run:
> {code}
> bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription
> {code}
> we’ll get:
> {code}
> Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml
> {code}
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and "text/plain" respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable parser, some composite mime types are not included in this list even though they are perfectly valid and parsable documents. This not taking into account that servers often return incorrect mime types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses regex expressions to define synonyms. In the first case any mime type that matches "application/(.*)\+xml" will be replaced by "application/xml". This way parse-tika will parse the document just fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)