You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/01/13 17:10:39 UTC

[jira] [Created] (TIKA-1514) http-equiv content-type extraction should pick first parseable content value

Tim Allison created TIKA-1514:
---------------------------------

             Summary: http-equiv content-type extraction should pick first parseable content value 
                 Key: TIKA-1514
                 URL: https://issues.apache.org/jira/browse/TIKA-1514
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.6
            Reporter: Tim Allison
            Priority: Trivial
             Fix For: 1.8


In a handful of files from govdocs1, there are some creative http-equiv content-type headers, including: 
{noformat}
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" name="keywords" content="DNRC, division of nutrition">
{noformat}

The content type that is going into the metadata for this file is "DNRC, division of nutrition".

Let's modify our html metaheader charset detector to pick the first parseable charset value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)