You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/04/28 14:53:32 UTC

[jira] Resolved: (TIKA-414) bug in CompositeParser.getParser function

     [ https://issues.apache.org/jira/browse/TIKA-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-414.
--------------------------------

      Assignee: Jukka Zitting
    Resolution: Duplicate

I've fixed this problem in revision 938966 as a part of the more generic issue TIKA-298.

> bug in CompositeParser.getParser function
> -----------------------------------------
>
>                 Key: TIKA-414
>                 URL: https://issues.apache.org/jira/browse/TIKA-414
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Piotr B.
>            Assignee: Jukka Zitting
>
> I've upgraded tika in my project to 0.7.
> After that for many html documents AutoDetectParser wrongly choses fallback parser instead of HtmlParser.
> Example of problematic html input:
> <html>
> <head>
> <title>test</title>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> </head>
> <body>test</body>
> </html>
> In this example AutoDetectParser sets Metadata.CONTENT_TYPE to "text/html; charset=utf-8",
> but there is no parser registered for that string.
> The solution is to fix getParser function in CompositeParser so as not to consider content type parameters (cut off the string from ';' to the end).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.