You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/05/12 15:06:41 UTC

[jira] Commented: (TIKA-422) Wrong charset conversion in some RTF documents.

    [ https://issues.apache.org/jira/browse/TIKA-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866559#action_12866559 ] 

Jukka Zitting commented on TIKA-422:
------------------------------------

Does anyone know an alternative RTF parser in Java with a friendly license [1]? It looks like there's little we can do about this as long as we're stuck with the Swing RTF parser.

[1] http://www.apache.org/legal/resolved.html


> Wrong charset conversion in some RTF documents.
> -----------------------------------------------
>
>                 Key: TIKA-422
>                 URL: https://issues.apache.org/jira/browse/TIKA-422
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Piotr B.
>         Attachments: test-windows-1250.rtf
>
>
> RTF parser uses javax.swing.text.rtf, but it sucks.
> It doesn't support '\ansicpg' tag (cite from RTF file format specification:
> "This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").
> Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (TIKA-422) Wrong charset conversion in some RTF documents.

Posted by Oleg Tikhonov <ol...@gmail.com>.
Hi Jukka.
Here are my thoughts:
1. From nutch
http://www.docjar.com/docs/api/org/apache/nutch/parse/rtf/package-index.html

2. OpenOffice writer Java API
http://wiki.services.openoffice.org/wiki/API/Samples/Java/Writer/TextDocumentStructure

Oleg.


On Wed, May 12, 2010 at 4:06 PM, Jukka Zitting (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866559#action_12866559]
>
> Jukka Zitting commented on TIKA-422:
> ------------------------------------
>
> Does anyone know an alternative RTF parser in Java with a friendly license
> [1]? It looks like there's little we can do about this as long as we're
> stuck with the Swing RTF parser.
>
> [1] http://www.apache.org/legal/resolved.html
>
>
> > Wrong charset conversion in some RTF documents.
> > -----------------------------------------------
> >
> >                 Key: TIKA-422
> >                 URL: https://issues.apache.org/jira/browse/TIKA-422
> >             Project: Tika
> >          Issue Type: Bug
> >          Components: parser
> >    Affects Versions: 0.7
> >            Reporter: Piotr B.
> >         Attachments: test-windows-1250.rtf
> >
> >
> > RTF parser uses javax.swing.text.rtf, but it sucks.
> > It doesn't support '\ansicpg' tag (cite from RTF file format
> specification:
> > "This keyword represents the default ANSI code page used to perform the
> Unicode to ANSI conversion when writing RTF text").
> > Unfortunately Windows WordPad saves nonascii characters using \ansicpg
> instead of supported by javax.swing.text.rtf unicode characters.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Best regards, Oleg.