You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/04/28 14:53:32 UTC
[jira] Resolved: (TIKA-414) bug in CompositeParser.getParser
function
[ https://issues.apache.org/jira/browse/TIKA-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-414.
--------------------------------
Assignee: Jukka Zitting
Resolution: Duplicate
I've fixed this problem in revision 938966 as a part of the more generic issue TIKA-298.
> bug in CompositeParser.getParser function
> -----------------------------------------
>
> Key: TIKA-414
> URL: https://issues.apache.org/jira/browse/TIKA-414
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Reporter: Piotr B.
> Assignee: Jukka Zitting
>
> I've upgraded tika in my project to 0.7.
> After that for many html documents AutoDetectParser wrongly choses fallback parser instead of HtmlParser.
> Example of problematic html input:
> <html>
> <head>
> <title>test</title>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> </head>
> <body>test</body>
> </html>
> In this example AutoDetectParser sets Metadata.CONTENT_TYPE to "text/html; charset=utf-8",
> but there is no parser registered for that string.
> The solution is to fix getParser function in CompositeParser so as not to consider content type parameters (cut off the string from ';' to the end).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.