You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2018/10/18 11:08:00 UTC
[jira] [Created] (TIKA-2758) Possible error charset detection
Markus Jelsma created TIKA-2758:
-----------------------------------
Summary: Possible error charset detection
Key: TIKA-2758
URL: https://issues.apache.org/jira/browse/TIKA-2758
Project: Tika
Issue Type: Bug
Components: core
Affects Versions: 1.18
Reporter: Markus Jelsma
Fix For: 1.20
Attachments: detroidnews.html, independent.html
I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.
Where we previously extracted text such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
Attached are the two HTML files.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)