You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/08/03 12:04:20 UTC
[jira] [Comment Edited] (TIKA-721) UTF16-LE not detected
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405785#comment-15405785 ]
Tim Allison edited comment on TIKA-721 at 8/3/16 12:03 PM:
-----------------------------------------------------------
While working on TIKA-2038, I found that ICU4J is now correctly identifying this file. I think if we add a stripper to ignore contents of <script>/<style> elements, we might consider promoting ICU4J to run before UniversalChardet... This would currently effectively turn off UniversalChardet, IIRC, because I think ICU4J is guaranteed to returned a non-null value.
A general test corpus would be great. I think if we follow the approach and test corpus of [~faghani], we should be able to evaluate the result of potential changes at least against the encodings in his corpus. We also have a decent number of files in our (TIKA-1302)'s regression corpus; most are depressingly English and/or UTF-8.
We could augment Shabali's corpus by transcoding to UTF-8/UTF-16 etc.
Proposed eval approach 1 ([~faghani]'s approach): assume the actual http-header or the http-meta header is accurate[1], run ICU4J and the UniversalCharDetector against the files and compare with the meta-header.
Proposed eval approach 2: compare potential changes against the current method. Run our tika-eval (TIKA-1332) module against the output and evaluate a random sample of files that have differing contents.
[1] This generally gives me great pause, but via random sampling, this appears to be reasonable in Shabanali's corpus.
was (Author: tallison@mitre.org):
While working on TIKA-2038, I found that ICU4J is now correctly identifying this file. I think if we add a stripper to ignore contents of <script>/<style> elements, we might consider promoting ICU4J to run before UniversalChardet... This would currently effectively turn off UniversalChardet, IIRC, because I think ICU4J is guaranteed to returned a non-null value.
A general test corpus would be great. I think if we follow the approach and test corpus of [~faghani], we should be able to evaluate the result of potential changes at least against the encodings in his corpus. We also have a decent number of files in our (TIKA-1302)'s regression corpus; most are depressingly English and/or UTF-8.
We could augment Shabali's corpus by transcoding to UTF-8/UTF-16 etc.
Proposed eval approach 1 ([~faghani]'s approach): assume http-meta header is accurate[1], run ICU4J and the UniversalCharDetector against the files and compare with the meta-header.
Proposed eval approach 2: compare potential changes against the current method. Run our tika-eval (TIKA-1332) module against the output and evaluate a random sample of files that have differing contents.
[1] This generally gives me great pause, but via random sampling, this appears to be reasonable in Shabanali's corpus.
> UTF16-LE not detected
> ---------------------
>
> Key: TIKA-721
> URL: https://issues.apache.org/jira/browse/TIKA-721
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250: confidence=9
> windows-1250: confidence=7
> windows-1252: confidence=7
> windows-1252: confidence=6
> windows-1252: confidence=5
> IBM420_ltr: confidence=4
> windows-1252: confidence=3
> windows-1254: confidence=2
> windows-1250: confidence=2
> windows-1252: confidence=2
> IBM420_rtl: confidence=1
> windows-1253: confidence=1
> windows-1250: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)