You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/05/12 15:06:41 UTC
[jira] Commented: (TIKA-422) Wrong charset conversion in some RTF
documents.
[ https://issues.apache.org/jira/browse/TIKA-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866559#action_12866559 ]
Jukka Zitting commented on TIKA-422:
------------------------------------
Does anyone know an alternative RTF parser in Java with a friendly license [1]? It looks like there's little we can do about this as long as we're stuck with the Swing RTF parser.
[1] http://www.apache.org/legal/resolved.html
> Wrong charset conversion in some RTF documents.
> -----------------------------------------------
>
> Key: TIKA-422
> URL: https://issues.apache.org/jira/browse/TIKA-422
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Reporter: Piotr B.
> Attachments: test-windows-1250.rtf
>
>
> RTF parser uses javax.swing.text.rtf, but it sucks.
> It doesn't support '\ansicpg' tag (cite from RTF file format specification:
> "This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").
> Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (TIKA-422) Wrong charset conversion in some RTF
documents.
Posted by Oleg Tikhonov <ol...@gmail.com>.
Hi Jukka.
Here are my thoughts:
1. From nutch
http://www.docjar.com/docs/api/org/apache/nutch/parse/rtf/package-index.html
2. OpenOffice writer Java API
http://wiki.services.openoffice.org/wiki/API/Samples/Java/Writer/TextDocumentStructure
Oleg.
On Wed, May 12, 2010 at 4:06 PM, Jukka Zitting (JIRA) <ji...@apache.org>wrote:
>
> [
> https://issues.apache.org/jira/browse/TIKA-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866559#action_12866559]
>
> Jukka Zitting commented on TIKA-422:
> ------------------------------------
>
> Does anyone know an alternative RTF parser in Java with a friendly license
> [1]? It looks like there's little we can do about this as long as we're
> stuck with the Swing RTF parser.
>
> [1] http://www.apache.org/legal/resolved.html
>
>
> > Wrong charset conversion in some RTF documents.
> > -----------------------------------------------
> >
> > Key: TIKA-422
> > URL: https://issues.apache.org/jira/browse/TIKA-422
> > Project: Tika
> > Issue Type: Bug
> > Components: parser
> > Affects Versions: 0.7
> > Reporter: Piotr B.
> > Attachments: test-windows-1250.rtf
> >
> >
> > RTF parser uses javax.swing.text.rtf, but it sucks.
> > It doesn't support '\ansicpg' tag (cite from RTF file format
> specification:
> > "This keyword represents the default ANSI code page used to perform the
> Unicode to ANSI conversion when writing RTF text").
> > Unfortunately Windows WordPad saves nonascii characters using \ansicpg
> instead of supported by javax.swing.text.rtf unicode characters.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
--
Best regards, Oleg.