You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/10/20 15:07:00 UTC

[jira] [Commented] (TIKA-2758) Possible error charset detection

    [ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657887#comment-16657887 ] 

Sebastian Nagel commented on TIKA-2758:
---------------------------------------

Both documents are encoded as UTF-8 and both specifiy
{noformat}
<meta http-equiv="content-type" content="text/html; charset=utf8" />
{noformat}
and "utf8" is contained in the [list of non-IANA charsets|https://github.com/apache/tika/blob/1.19.1/tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt] (introduced by TIKA-2592). "utf8" skipped as charset hint and the default ISO-8859-1 is used.

> Possible error charset detection
> --------------------------------
>
>                 Key: TIKA-2758
>                 URL: https://issues.apache.org/jira/browse/TIKA-2758
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.20
>
>         Attachments: detroidnews.html, independent.html
>
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
> Attached are the two HTML files.
> Reading our tests again, i see an old note besides the indepedent test complaining about the character encoding being incorrect. It seems somewhere before 1.17 it was faultly just as it is now with 1.18 and higher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)