You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/01 21:12:10 UTC

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-683:
------------------------------------

    Attachment: TIKA-683.patch

Attached patch, with a first cut at using a simple (shallow) tokenizer
to interpret the specific RTF control words that determine what text
is rendered.  I built this using the 1.9.1 RTF specification:

  http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=10725

It's still rough (many nocommits) but I think it's close.  All tests
pass, including a few new RTF test cases I've added.

I just created a custom tokenizer (the allowed RTF tokens are very
simple) and shallow parser.  I think later we can/should cutover to a
"real" tokenizer/parser (eg JFlex)...

The new parser does a better job at extracting some doc structure; the
current parser just makes a single paragraph, but the new one makes a
paragraph whenever the doc said there was one.  But it doesn't give
structure for tables, lists (it does extract their text).

It finds text that the old parser missed, eg footnotes, hyperlink,
header/footer, text inside a picture, and [generally] does not add
extra whitespace (the old one sometimes breaks a word by inserting a
space).  Finally the new parser fixes the unicode character doubling
(this issue)...

One thing I still have to fix is that it can output mis-matched tags
for i/b styles (spookily nothing failed; maybe we should add simple
validation (under asserts) eg to XHTMLContentHandler?).


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira