You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Commented) (JIRA)" <ji...@apache.org> on 2011/10/02 17:04:35 UTC

[jira] [Commented] (TIKA-721) UTF16-LE not detected

    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119018#comment-13119018 ] 

Nick Burch commented on TIKA-721:
---------------------------------

I'd suggest we check for invalid UTF-16 sequences (see http://unicode.org/faq/utf_bom.html#utf16-7 for which ranges aren't allowed). If we don't find any invalid ones, then we'll still need to do some detection. Looking for 0-255+0,0-255+0 or 0+0-255,0+0-255 will work well for western european stuff, but others will probably need fancy munging of existing ngrams....
                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira