You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/01/13 17:10:39 UTC
[jira] [Created] (TIKA-1514) http-equiv content-type extraction
should pick first parseable content value
Tim Allison created TIKA-1514:
---------------------------------
Summary: http-equiv content-type extraction should pick first parseable content value
Key: TIKA-1514
URL: https://issues.apache.org/jira/browse/TIKA-1514
Project: Tika
Issue Type: Bug
Affects Versions: 1.6
Reporter: Tim Allison
Priority: Trivial
Fix For: 1.8
In a handful of files from govdocs1, there are some creative http-equiv content-type headers, including:
{noformat}
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" name="keywords" content="DNRC, division of nutrition">
{noformat}
The content type that is going into the metadata for this file is "DNRC, division of nutrition".
Let's modify our html metaheader charset detector to pick the first parseable charset value.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)