You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/19 18:29:09 UTC

[jira] [Created] (TIKA-720) EBCDIC encoding not detected

EBCDIC encoding not detected
----------------------------

                 Key: TIKA-720
                 URL: https://issues.apache.org/jira/browse/TIKA-720
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor


I have a test file encoded in EBCDIC, but Tika fails to detect it.

Not sure we can realistically fix this; I have no idea how (and,
realistically, one really ought to convert out of EBCDIC on export
from a mainframe...).

Here's what Tika detects:

{noformat}
Shift_JIS:      confidence=51
Big5:           confidence=40
GB18030:        confidence=10
KOI8-R:         confidence=5
windows-1252:   confidence=5
windows-1253:   confidence=2
IBM866:         confidence=1
windows-1251:   confidence=1
windows-1250:   confidence=1
{noformat}

The test file decodes fine as cp500; eg in Python just run this:

{noformat}
import codecs
codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
{noformat}


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113525#comment-13113525 ] 

Michael McCandless commented on TIKA-720:
-----------------------------------------

Thanks Nick -- I like this solution (pre-mapping bytes to their latin1 equivalents and then running our existing language detection).

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108034#comment-13108034 ] 

Michael McCandless commented on TIKA-720:
-----------------------------------------

Thanks Nick!  That actually sounds promising...

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107977#comment-13107977 ] 

Nick Burch commented on TIKA-720:
---------------------------------

A few IBM specific encodings are supported already in CharsetRecog_sbcs, looks like this one is missing though

We'll need to find some suitable detection ngrams, which shouldn't be too hard as I seem to recall that EBCDIC puts a-z, A-Z and 0-9 in a very different place to ascii / the iso8859 formats

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112756#comment-13112756 ] 

Nick Burch commented on TIKA-720:
---------------------------------

Here's the thread (no replies yet...) on the ICU4J mailing list about adding the extra recognizers: http://sourceforge.net/mailarchive/message.php?msg_id=28126135

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113394#comment-13113394 ] 

Nick Burch commented on TIKA-720:
---------------------------------

Turned out not to be too hard to add, even without any advice from the ICU4J project (though it may help that I did some EBCDIC <-> Ascii stuff years ago...!)

I've added the detector and tests in r1174680. 

Patch submitted upstream: http://bugs.icu-project.org/trac/ticket/8842

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107979#comment-13107979 ] 

Nick Burch commented on TIKA-720:
---------------------------------

And on a related note, do we have any documentation anywhere for how to go about adding new recognisers? (What the byteMaps need to contain, and how best to build the ngrams)

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112592#comment-13112592 ] 

Nick Burch commented on TIKA-720:
---------------------------------

I've spent a bit of time studying the code (which comes from icu4j), and I think I know roughly how it works

I've sent an email to the icu mailing list asking for some clarifications though, hopefully armed with the answers we can add this support

In the mean time, do you have some more sample files we could use for testing/ngram identification? Especially interesting would be ones in other varients of EBCIDIC

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-720) EBCDIC encoding not detected

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-720:
------------------------------------

    Attachment: English_EBCDIC.txt

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112668#comment-13112668 ] 

Michael McCandless commented on TIKA-720:
-----------------------------------------

Thanks Nick!  I'll see if I can find some more sample files..

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-720) EBCDIC encoding not detected

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-720.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.10

I think this was fixed in 0.10?
                
> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 0.10
>
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira