You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Hans Brende (JIRA)" <ji...@apache.org> on 2018/08/05 23:53:00 UTC

[jira] [Resolved] (ANY23-385) Improve charset detection for (x)html documents

     [ https://issues.apache.org/jira/browse/ANY23-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hans Brende resolved ANY23-385.
-------------------------------
    Resolution: Fixed

> Improve charset detection for (x)html documents
> -----------------------------------------------
>
>                 Key: ANY23-385
>                 URL: https://issues.apache.org/jira/browse/ANY23-385
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> When attempting to detect a document's encoding, our {{TikaEncodingDetector}} does not take into account the following elements which may occur in html/xhtml documents:
> HTML:
> {{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
> HTML5: 
> {{<meta charset="xyz">}}
> XHTML:
> {{<?xml encoding='xyz'?>}}
> In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.) 
> I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)