You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "gross (JIRA)" <ji...@apache.org> on 2010/12/17 00:21:02 UTC
[jira] Created: (TIKA-574) Support for IBM866 (CP866) encoding in
TXTParser
Support for IBM866 (CP866) encoding in TXTParser
------------------------------------------------
Key: TIKA-574
URL: https://issues.apache.org/jira/browse/TIKA-574
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 0.8
Environment: GNU/Linux 2.6.35-23, openjdk6
Reporter: gross
Priority: Minor
Fix For: 0.9, 1.0, 0.8
Attachments: tika-0.8-cp866.patch
There's no recognizer for CP866 (DOS russian encoding) in tika yet.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-574) Support for IBM866 (CP866) encoding in
TXTParser
Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxim Valyanskiy updated TIKA-574:
----------------------------------
Attachment: TIKA-574.patch
Thank you. I added unit-test for this issue
> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
> Key: TIKA-574
> URL: https://issues.apache.org/jira/browse/TIKA-574
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.8
> Environment: GNU/Linux 2.6.35-23, openjdk6
> Reporter: gross
> Priority: Minor
> Fix For: 0.8, 0.9, 1.0
>
> Attachments: tika-0.8-cp866.patch, TIKA-574.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-574) Support for IBM866 (CP866) encoding in
TXTParser
Posted by "Kostya Gribov (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kostya Gribov resolved TIKA-574.
--------------------------------
Resolution: Fixed
> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
> Key: TIKA-574
> URL: https://issues.apache.org/jira/browse/TIKA-574
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.8
> Environment: GNU/Linux 2.6.35-23, openjdk6
> Reporter: Kostya Gribov
> Priority: Minor
> Fix For: 0.9, 1.0, 0.8
>
> Attachments: tika-0.8-cp866.patch, TIKA-574.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-574) Support for IBM866 (CP866) encoding in
TXTParser
Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972445#action_12972445 ]
Maxim Valyanskiy commented on TIKA-574:
---------------------------------------
Thank you. Commited in r1050348
> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
> Key: TIKA-574
> URL: https://issues.apache.org/jira/browse/TIKA-574
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.8
> Environment: GNU/Linux 2.6.35-23, openjdk6
> Reporter: Kostya Gribov
> Priority: Minor
> Fix For: 0.8, 0.9, 1.0
>
> Attachments: tika-0.8-cp866.patch, TIKA-574.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-574) Support for IBM866 (CP866) encoding in
TXTParser
Posted by "gross (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gross updated TIKA-574:
-----------------------
Attachment: tika-0.8-cp866.patch
I've used ngrams from cp1251 and wrote custom byteMap. All russian letters, used in cp1251 are present in cp866, so no changes in NGrams needed.
Added inner static class in CharsetRecog_sbcs and CharsetDetector#createRecognizers modified to register this class.
> Support for IBM866 (CP866) encoding in TXTParser
> ------------------------------------------------
>
> Key: TIKA-574
> URL: https://issues.apache.org/jira/browse/TIKA-574
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.8
> Environment: GNU/Linux 2.6.35-23, openjdk6
> Reporter: gross
> Priority: Minor
> Fix For: 0.8, 0.9, 1.0
>
> Attachments: tika-0.8-cp866.patch
>
>
> There's no recognizer for CP866 (DOS russian encoding) in tika yet.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.