You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/19 18:41:08 UTC

[jira] [Created] (TIKA-721) UTF16-LE not detected

UTF16-LE not detected
---------------------

                 Key: TIKA-721
                 URL: https://issues.apache.org/jira/browse/TIKA-721
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor
         Attachments: Chinese_Simplified_utf16.txt

I have a test file encoded in UTF16-LE, but Tika fails to detect it.

Note that it is missing the BOM, which is not allowed (for UTF16-BE
the BOM is optional).

Not sure we can realistically fix this; I have no idea how...

Here's what Tika detects:

{noformat}
windows-1250:   confidence=9
windows-1250:   confidence=7
windows-1252:   confidence=7
windows-1252:   confidence=6
windows-1252:   confidence=5
IBM420_ltr:     confidence=4
windows-1252:   confidence=3
windows-1254:   confidence=2
windows-1250:   confidence=2
windows-1252:   confidence=2
IBM420_rtl:     confidence=1
windows-1253:   confidence=1
windows-1250:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
{noformat}

The test file decodes fine as UTF16-LE; eg in Python just run this:

{noformat}
import codecs
codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
{noformat}


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-721) UTF16-LE not detected

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119038#comment-13119038 ] 

Robert Muir commented on TIKA-721:
----------------------------------

{quote}
Finally, for the valid code points, I count how many times each
unicode block had a character; usually a doc will be a in single
language and have high percentage of its chars from a single block (I
think!?).
{quote}

I don't think this is a good idea: languages like japanese use multiple blocks, and many writing 
systems (e.g. cyrillic/arabic/etc) tend to use ascii digits and punctuation...

{quote}
If I decode to a Unicode code point, I then call Java's
Character.isDefined to see if it's really valid
{quote}

I don't think this is that great either: e.g. java 6 supports a very old version of the unicode standard (4.x) and that method will return false for any completely valid newer unicode characters.

                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-721) UTF16-LE not detected

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-721:
------------------------------------

    Attachment: Chinese_Simplified_utf16.txt

> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-721) UTF16-LE not detected

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-721:
------------------------------------

    Attachment: TIKA-721.patch

Attached patch, using three simple heuristics:

First, I compute the count distribution of each of the 256 possible
byte values for the even vs odd bytes, and compute dot-product between
those two (unit-length-normalized) vectors.  For double-byte charsets,
usually the doc-product will be low, because even vs odd bytes act
very differently; but usually very high (near 1.0) for single-byte
charsets.

Second, I decode all the bytes according to LE or BE, into UTF16 code
units, and then count up basic stats: the number of valid and invalid
surrogates, the number of valid and invalid code points.

Finally, for the valid code points, I count how many times each
unicode block had a character; usually a doc will be a in single
language and have high percentage of its chars from a single block (I
think!?).

Then I use simple heuristics from these stats to get a rough
confidence.  I made [educated] guesses for thresholds to set the
confidence choices, having run on random files I have locally... but
I'd really prefer to find a nice corpus somewhere to do a more
thorough test.

                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-721) UTF16-LE not detected

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107969#comment-13107969 ] 

Nick Burch commented on TIKA-721:
---------------------------------

In CharsetRecog_Unicode on line 69 (inside CharsetRecog_UTF_16_LE) there is a TODO for this very case...

> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-721) UTF16-LE not detected

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119018#comment-13119018 ] 

Nick Burch commented on TIKA-721:
---------------------------------

I'd suggest we check for invalid UTF-16 sequences (see http://unicode.org/faq/utf_bom.html#utf16-7 for which ranges aren't allowed). If we don't find any invalid ones, then we'll still need to do some detection. Looking for 0-255+0,0-255+0 or 0+0-255,0+0-255 will work well for western european stuff, but others will probably need fancy munging of existing ngrams....
                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-721) UTF16-LE not detected

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119044#comment-13119044 ] 

Michael McCandless commented on TIKA-721:
-----------------------------------------


{quote}
bq. Finally, for the valid code points, I count how many times each unicode block had a character; usually a doc will be a in single language and have high percentage of its chars from a single block (I think!?).

I don't think this is a good idea: languages like japanese use multiple blocks, and many writing 
systems (e.g. cyrillic/arabic/etc) tend to use ascii digits and punctuation...
{quote}

Hmm, but, what this means is for such docs the new detector gives a
worse confidence than it "should".  Ie it will result in false
negatives, not false positives.

Maybe we can use "total number of unique blocks" somehow.  For false
matches I see lots of random blocks being used (a "long tail") but for
a good match, just a few.

{quote}
bq. If I decode to a Unicode code point, I then call Java's Character.isDefined to see if it's really valid

I don't think this is that great either: e.g. java 6 supports a very old version of the unicode standard (4.x) and that method will return false for any completely valid newer unicode characters.
{quote}

Is there a more accurate way to check validity?  We can use the coarse
checks from the FAQ, but that doesn't rule out much.

So this means newer unicode docs (using chars after Unicode 4.x) will
be seen as invalid and we won't detect them. But this will also cause
false negatives not false positives... what pctg of the world's docs
use the newer chars?

Maybe we'll have to couple language detection w/ UTF16 LE/BE
detection to get better accuracy.

Remember we do no detection for UTF16 LE/BE at all, now, and this
patch would at least allow some (if not all) cases to be detected.  So
that'd be progress, even if it doesn't catch all cases it should.

It's the risk of false positives I'm more concerned about, ie where
some other double-byte charset is correctly identified today, but
breaks when we commit this; that said, I produce fairly low confidence
from the detector, except when I see valid surrogate pairs, so this
*should* be rare.

Still I would really love to test against a corpus...

                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-721) UTF16-LE not detected

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-721:
---------------------------------------

    Assignee: Michael McCandless
    
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-721) UTF16-LE not detected

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119035#comment-13119035 ] 

Michael McCandless commented on TIKA-721:
-----------------------------------------

bq. I'd suggest we check for invalid UTF-16 sequences (see http://unicode.org/faq/utf_bom.html#utf16-7 for which ranges aren't allowed).

I tried to do that: if I see unpaired surrogate, or invalid pair, then
I count that as invalid.

If I decode to a Unicode code point, I then call Java's
Character.isDefined to see if it's really valid, though I'm not sure
this catches all the invalid cases from that FAQ.

bq. If we don't find any invalid ones, then we'll still need to do some detection. Looking for 0-255+0,0-255+0 or 0+0-255,0+0-255 will work well for western european stuff, but others will probably need fancy munging of existing ngrams....

I didn't try to any language detection (I leave language null), except
for the "many chars fall into single block" simple heuristic.

                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira