You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Cristian Vat <cr...@gmail.com> on 2009/09/11 23:57:34 UTC

RTF Parser - encoding issue

Hello,

>From what I've seen, it seems the RTFEditorKit used by the tika RTFParser
has a problem with some encodings in rtf files.
I've seen it happen with some rtf files generated by Microsoft Word and
having Czech characters. ( one example being "ř", unicode code point 0159 )
The problem seems to be how Word encodes certain characters in rtf files,
but I think it's possible that the issue applies to other encodings and
other editors.

It's clearly not a tika-specific issue, but I'd like to know if there are
plans to improve the rtf support maybe by using a different library with
better support.

Behavior experienced in tika 0.4, built using JDK 1.5.0_16, ran using
jdk1.5.0_16 or jdk1.6.0_13, same behavior in both.

-
Cristian Vat

Re: RTF Parser - encoding issue

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Sep 11, 2009 at 11:57 PM, Cristian Vat <cr...@gmail.com> wrote:
> It's clearly not a tika-specific issue, but I'd like to know if there are
> plans to improve the rtf support maybe by using a different library with
> better support.

There are no current plans for that, but you can affect the plan by
filing an improvement request in the Tika issue tracker at
https://issues.apache.org/jira/browse/TIKA. Even better if you have
some proposals on how we could/should implement such improvements.

BR,

Jukka Zitting