You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "gross (JIRA)" <ji...@apache.org> on 2010/12/17 00:21:02 UTC

[jira] Created: (TIKA-574) Support for IBM866 (CP866) encoding in TXTParser

Support for IBM866 (CP866) encoding in TXTParser
------------------------------------------------

                 Key: TIKA-574
                 URL: https://issues.apache.org/jira/browse/TIKA-574
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.8
         Environment: GNU/Linux 2.6.35-23, openjdk6
            Reporter: gross
            Priority: Minor
             Fix For: 0.9, 1.0, 0.8
         Attachments: tika-0.8-cp866.patch

There's no recognizer for CP866 (DOS russian encoding) in tika yet.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-574) Support for IBM866 (CP866) encoding in TXTParser

Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Valyanskiy updated TIKA-574:
----------------------------------

    Attachment: TIKA-574.patch

Thank you. I added unit-test for this issue

> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
>                 Key: TIKA-574
>                 URL: https://issues.apache.org/jira/browse/TIKA-574
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.8
>         Environment: GNU/Linux 2.6.35-23, openjdk6
>            Reporter: gross
>            Priority: Minor
>             Fix For: 0.8, 0.9, 1.0
>
>         Attachments: tika-0.8-cp866.patch, TIKA-574.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-574) Support for IBM866 (CP866) encoding in TXTParser

Posted by "Kostya Gribov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kostya Gribov resolved TIKA-574.
--------------------------------

    Resolution: Fixed

> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
>                 Key: TIKA-574
>                 URL: https://issues.apache.org/jira/browse/TIKA-574
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.8
>         Environment: GNU/Linux 2.6.35-23, openjdk6
>            Reporter: Kostya Gribov
>            Priority: Minor
>             Fix For: 0.9, 1.0, 0.8
>
>         Attachments: tika-0.8-cp866.patch, TIKA-574.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-574) Support for IBM866 (CP866) encoding in TXTParser

Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972445#action_12972445 ] 

Maxim Valyanskiy commented on TIKA-574:
---------------------------------------

Thank you. Commited in r1050348

> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
>                 Key: TIKA-574
>                 URL: https://issues.apache.org/jira/browse/TIKA-574
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.8
>         Environment: GNU/Linux 2.6.35-23, openjdk6
>            Reporter: Kostya Gribov
>            Priority: Minor
>             Fix For: 0.8, 0.9, 1.0
>
>         Attachments: tika-0.8-cp866.patch, TIKA-574.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-574) Support for IBM866 (CP866) encoding in TXTParser

Posted by "gross (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gross updated TIKA-574:
-----------------------

    Attachment: tika-0.8-cp866.patch

I've used ngrams from cp1251 and wrote custom byteMap. All russian letters, used in cp1251 are present in cp866, so no changes in NGrams needed.

Added inner static class in CharsetRecog_sbcs and CharsetDetector#createRecognizers modified to register this class.


> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
>                 Key: TIKA-574
>                 URL: https://issues.apache.org/jira/browse/TIKA-574
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.8
>         Environment: GNU/Linux 2.6.35-23, openjdk6
>            Reporter: gross
>            Priority: Minor
>             Fix For: 0.8, 0.9, 1.0
>
>         Attachments: tika-0.8-cp866.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.