You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Thiago Souza (JIRA)" <ji...@apache.org> on 2010/11/10 20:30:14 UTC

[jira] Commented: (TIKA-392) RTF parser smashes words together in subsequent table cells

    [ https://issues.apache.org/jira/browse/TIKA-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930726#action_12930726 ] 

Thiago Souza commented on TIKA-392:
-----------------------------------

This extra space is being added in case of words with accents since the insertString method is invoked for each letter with accent.

For example for the phrase inside RTF (in portuguese):

       "GOVERNO DO ESTADO DO ESPÍRITO SANTO"

Will be extracted to:

        "GOVERNO DO ESTADO DO ESP Í RITO SANTO"

Since insertString is invoked with: "GOVERNO DO ESTADO DO ESP", "Í" and "RITO SANTO".

I just don't know if this is a problem with RTFEditorKit or RTFParser.

Any workaround?

> RTF parser smashes words together in subsequent table cells
> -----------------------------------------------------------
>
>                 Key: TIKA-392
>                 URL: https://issues.apache.org/jira/browse/TIKA-392
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.7
>
>
> I have an RTF document with the following snippet of content (it's an export of a private phone book so I can't share the full document):
> {\rtlch\fcs1 \af0\afs24 \ltrch\fcs0 \f0\fs24\lang2055\langfe2055\langfenp2055\insrsid9461491\charrsid9461491 Fax / Phone Station\cell Fax / Phone #\cell }
> The extracted text is:
> Fax / Phone StationFax / Phone
> Note how the cell boundary between "Station" and "Fax" is lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.