You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2016/10/19 16:23:02 UTC
[jira] [Updated] (TIKA-539) Encoding detection is too biased by
encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-539:
-----------------------------------
Fix Version/s: (was: 1.14)
1.15
> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
> Issue Type: Improvement
> Components: metadata, parser
> Affects Versions: 0.8, 0.9, 0.10
> Reporter: Reinhard Schwab
> Assignee: Ken Krugler
> Priority: Minor
> Fix For: 1.15
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be from http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
> + "<meta http-equiv=\"content-type\" content=\"application/xhtml+xml; charset=iso-8859-1\" />"
> + "</head><body>Über den Wolken\n</body></html>";
> /**
> * @param args
> * @throws IOException
> * @throws TikaException
> * @throws SAXException
> */
> public static void main(String[] args) throws IOException, SAXException,
> TikaException {
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html");
> metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
> System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
> InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler h = new BodyContentHandler(10000);
> parser.parse(in, h, metadata, new ParseContext());
> System.out.print(h.toString());
> System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)