You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Hans Brende (JIRA)" <ji...@apache.org> on 2018/08/05 18:21:00 UTC
[jira] [Updated] (ANY23-385) Improve charset detection for (x)html
documents
[ https://issues.apache.org/jira/browse/ANY23-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hans Brende updated ANY23-385:
------------------------------
Description:
When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:
HTML:
{{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
HTML5:
{{<meta charset="xyz">}}
XHTML:
{{<?xml encoding='xyz'?>}}
In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.)
I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).
was:
When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:
HTML:
{{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
HTML5:
{{<meta charset="xyz">}}
XHTML:
{{<?xml encoding='xyz'?>}}
In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.)
I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing certain JSON-LD strings to come out looking like gibberish).
> Improve charset detection for (x)html documents
> -----------------------------------------------
>
> Key: ANY23-385
> URL: https://issues.apache.org/jira/browse/ANY23-385
> Project: Apache Any23
> Issue Type: Improvement
> Components: encoding
> Affects Versions: 2.3
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:
> HTML:
> {{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
> HTML5:
> {{<meta charset="xyz">}}
> XHTML:
> {{<?xml encoding='xyz'?>}}
> In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.)
> I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)