You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Hans Brende (JIRA)" <ji...@apache.org> on 2018/08/05 18:21:00 UTC

[jira] [Updated] (ANY23-385) Improve charset detection for (x)html documents

     [ https://issues.apache.org/jira/browse/ANY23-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hans Brende updated ANY23-385:
------------------------------
    Description: 
When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:

HTML:
{{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}

HTML5: 
{{<meta charset="xyz">}}

XHTML:
{{<?xml encoding='xyz'?>}}

In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.) 

I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).


  was:
When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:

HTML:
{{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}

HTML5: 
{{<meta charset="xyz">}}

XHTML:
{{<?xml encoding='xyz'?>}}

In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.) 

I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing certain JSON-LD strings to come out looking like gibberish).



> Improve charset detection for (x)html documents
> -----------------------------------------------
>
>                 Key: ANY23-385
>                 URL: https://issues.apache.org/jira/browse/ANY23-385
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:
> HTML:
> {{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
> HTML5: 
> {{<meta charset="xyz">}}
> XHTML:
> {{<?xml encoding='xyz'?>}}
> In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.) 
> I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)