You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/08/03 12:04:20 UTC
[jira] [Comment Edited] (TIKA-721) UTF16-LE not detected

    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405785#comment-15405785 ] 

Tim Allison edited comment on TIKA-721 at 8/3/16 12:03 PM:
-----------------------------------------------------------

While working on TIKA-2038, I found that ICU4J is now correctly identifying this file.  I think if we add a stripper to ignore contents of <script>/<style> elements, we might consider promoting ICU4J to run before UniversalChardet...  This would currently effectively turn off UniversalChardet, IIRC, because I think ICU4J is guaranteed to returned a non-null value.

A general test corpus would be great.  I think if we follow the approach and test corpus of [~faghani], we should be able to evaluate the result of potential changes at least against the encodings in his corpus.  We also have a decent number of files in our (TIKA-1302)'s regression corpus; most are depressingly English and/or UTF-8.

We could augment Shabali's corpus by transcoding to UTF-8/UTF-16 etc.

Proposed eval approach 1 ([~faghani]'s approach): assume the actual http-header or the http-meta header is accurate[1], run ICU4J and the UniversalCharDetector against the files and compare with the meta-header.

Proposed eval approach 2: compare potential changes against the current method.  Run our tika-eval (TIKA-1332) module against the output and evaluate a random sample of files that have differing contents.

[1] This generally gives me great pause, but via random sampling, this appears to be reasonable in Shabanali's corpus.


was (Author: tallison@mitre.org):
While working on TIKA-2038, I found that ICU4J is now correctly identifying this file.  I think if we add a stripper to ignore contents of <script>/<style> elements, we might consider promoting ICU4J to run before UniversalChardet...  This would currently effectively turn off UniversalChardet, IIRC, because I think ICU4J is guaranteed to returned a non-null value.

A general test corpus would be great.  I think if we follow the approach and test corpus of [~faghani], we should be able to evaluate the result of potential changes at least against the encodings in his corpus.  We also have a decent number of files in our (TIKA-1302)'s regression corpus; most are depressingly English and/or UTF-8.

We could augment Shabali's corpus by transcoding to UTF-8/UTF-16 etc.

Proposed eval approach 1 ([~faghani]'s approach): assume http-meta header is accurate[1], run ICU4J and the UniversalCharDetector against the files and compare with the meta-header.

Proposed eval approach 2: compare potential changes against the current method.  Run our tika-eval (TIKA-1332) module against the output and evaluate a random sample of files that have differing contents.

[1] This generally gives me great pause, but via random sampling, this appears to be reasonable in Shabanali's corpus.

> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)