You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Piotr B. (JIRA)" <ji...@apache.org> on 2010/04/28 10:01:32 UTC

[jira] Created: (TIKA-414) bug in CompositeParser.getParser function

bug in CompositeParser.getParser function
-----------------------------------------

                 Key: TIKA-414
                 URL: https://issues.apache.org/jira/browse/TIKA-414
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
            Reporter: Piotr B.


I've upgraded tika in my project to 0.7.
After that for many html documents AutoDetectParser wrongly choses fallback parser instead of HtmlParser.

Example of problematic html input:

<html>
<head>
<title>test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>test</body>
</html>


In this example AutoDetectParser sets Metadata.CONTENT_TYPE to "text/html; charset=utf-8",
but there is no parser registered for that string.

The solution is to fix getParser function in CompositeParser so as not to consider content type parameters (cut off the string from ';' to the end).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-414) bug in CompositeParser.getParser function

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-414.
--------------------------------

      Assignee: Jukka Zitting
    Resolution: Duplicate

I've fixed this problem in revision 938966 as a part of the more generic issue TIKA-298.

> bug in CompositeParser.getParser function
> -----------------------------------------
>
>                 Key: TIKA-414
>                 URL: https://issues.apache.org/jira/browse/TIKA-414
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Piotr B.
>            Assignee: Jukka Zitting
>
> I've upgraded tika in my project to 0.7.
> After that for many html documents AutoDetectParser wrongly choses fallback parser instead of HtmlParser.
> Example of problematic html input:
> <html>
> <head>
> <title>test</title>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> </head>
> <body>test</body>
> </html>
> In this example AutoDetectParser sets Metadata.CONTENT_TYPE to "text/html; charset=utf-8",
> but there is no parser registered for that string.
> The solution is to fix getParser function in CompositeParser so as not to consider content type parameters (cut off the string from ';' to the end).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.