You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2011/07/15 16:23:00 UTC

[jira] [Created] (TIKA-683) RTF Parser issues with non european characters

RTF Parser issues with non european characters
----------------------------------------------

                 Key: TIKA-683
                 URL: https://issues.apache.org/jira/browse/TIKA-683
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Nick Burch


As reported on user@ in "non-West European languages support":
  http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E

The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085021#comment-13085021 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------


NOTE: I know very little about RTF!  So please forgive/correct any
confusions below:

It looks like we need a stack to record the \ucN control chars we've
encountered, at each depth, and we must then skip N ansi chars after
each \uXXXX we see?  (Similarly to how we track the charset with
charsetQueue now).

Ie, on seeing \uXXXX (possibly followed by trailing space, which does
not count in the skip count), we parse and keep that XXXX unicode
character, re-emitting the \uXXXX in our output data, but then we
remove the following N ansi chars.

Some other things I noticed in RTFParser.java; I'm not sure if they
are really a problem in pratice:

  * I'm worried about how we replace \cell with \u0020\cell --
    depending on the last \ucN control word, this could mean we
    incorrectly skip some number of ansi chars?  Changing to
    {\u20}\cell would be safer since on group end the pending skip
    chars are reset to 0.

  * But then I also wonder if all the additional groups we are
    creating (because we surround each \uXXXX with { }) are somehow
    costly, eg if it causes RTFEditorKit to use more RAM / be slower /
    something.

  * When we look for the \ansicpgNNNN control word, I noticed we then
    look up the NNNN in the FONTSET_MAP -- is that wrong?  EG when I
    look at the possible values for NNNN (at
    http://latex2rtf.sourceforge.net/rtfspec_6.html) I see a bunch of
    numbers that aren't in the FONTSET_MAP.  We also use FONTSET_MAP
    for \fcharsetNNN but the values for that control word look
    correct.

  * We don't seem to handle the opening charset in the RTF header (ie,
    \ansi, \mac, \pc, \pca)?


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087694#comment-13087694 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

bq. Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader. 

Ooh that sounds interesting!  Does it have enough hooks so a subclass
can "tag along" to know what font is in-use and then intercept the
\'XX hex escapes?

Poaching either Harmony's parser or maybe OpenOffice's (C, but we
could port the parts we poach to Java) seems like a good way to go?

Either that or we make our own simple tokenizer?  The RTF spec looks
[relatively] simple enough, and Tika only needs to get the text out
(at least for today?), so we need not do heavy parsing of all
formatting / document structure.  A simple tokenizer that just decoded
the control words we care about (charset, font default, charset,
table) should work well and be robust to parser bugs / small errors in
the doc.

I'm also worried about the test coverage of the our RTF
parsing... would be nice to find (or somehow randomly generate) some
biggish collection of RTF + "expected text" test cases.  Maybe we can
poach tests from OpenOffice....

I noticed some tests allow for / expect extra whitespace to be
inserted in the returned text, but that makes me nervous... I think
(ideally) Tika shouldn't insert extra whitespace if we can help it.
Though, some cases likely need it, eg text from adjacent table cells.


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-683) RTF Parser issues with non european characters

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-683:
--------------------------------------

    Assignee: Chris A. Mattmann

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-683:
----------------------------

    Attachment: testRTFJapanese.rtf

Add test file. Based on Jp_euc-jp_rtf1.rtf from http://mail-archives.apache.org/mod_mbox/tika-user/201106.mbox/%3COF03CF5CF6.40C9789F-ONC22578BC.0035A24F-C22578BC.0036C220@il.ibm.com%3E but with images removed to keep the size sane

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: testRTFJapanese.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Cristian Vat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cristian Vat updated TIKA-683:
------------------------------

    Attachment: TIKA-683.patch

Patch with reduced test file and new test for character doubling in RTFParserTest

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Cristian Vat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cristian Vat updated TIKA-683:
------------------------------

    Attachment: testUnicodeUCNControlWordCharacterDoubling.rtf

Test file for \ucN control word character doubling

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095904#comment-13095904 ] 

Jukka Zitting commented on TIKA-683:
------------------------------------

+1, I'm eager to see us drop the javax.swing dependency with something we can directly fix and improve.

The org.apache.tika.sax.SaveContentHandler class already does some sanitization of SAX events, so that might be a good place to also check that tags are correctly nested. Though as Uwe said, ideally the generator of the SAX events would already take care of producing valid output.

PS. I'd rather use a separate .java file for the ExtractRTFText class than have it as a static inner class inside RTFParser. We can keep it package-private if we don't want to expose it directly to downstream clients.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086384#comment-13086384 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

Thanks Chris!

Actually both Christian's patch and mine are test cases.

Christian's test case fails (showing this issue); we don't yet have a patch to fix this issue (but we know what's wrong -- we have to handle the \ucN control codes).

My test case (TIKA-683-unicode-testcase.patch) passes and can be committed right away -- it's testing another aspect of RTF+Unicode which (happily) seems to be working correctly.

I also attached a new test case, passing, on TIKA-422, so if you could commit that one also that'd be great!

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105362#comment-13105362 ] 

Chris A. Mattmann commented on TIKA-683:
----------------------------------------

Hey Mike, +1 to commit, go for it!

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105499#comment-13105499 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

I opened TIKA-715 for the mis-matched XHTML events.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086410#comment-13086410 ] 

Chris A. Mattmann commented on TIKA-683:
----------------------------------------

Thanks Mike, I went ahead and committed your patch in TIKA-422 (r1158779) and your unit test patch in TIKA-683 in r1158785.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095644#comment-13095644 ] 

Uwe Schindler commented on TIKA-683:
------------------------------------

XML SAX Handling does not validate the element names, like opening and closing elements are the same. And the serializer in most cases only outputs the elements it get reported, some of those serializers will go crazy :-)

The reason for this is, because SAX is in general seldom used to generate xml documents, its more XML parsers that report elements they found in an XML document. And those parsers do the validating before, so theoretically, your parser must do this. For speed reasons there are no checks in serializers. You can enforce checks by piping the whole stuff through javax.xml.validator API, but this would also check a schema, which does not really exists for XHTML.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-683:
------------------------------------

    Attachment: TIKA-683-unicode-testcase.patch

I was curious/nervous whether the RTFParser (and RTF format itself) properly handled non-BMP unicode characters, so with Robert Muir's help I created a basic test case (attached) and indeed at least for these Gothic characters in particular non-BMP is handled fine: the test passes.

It turns out (apparently) each \uXXX is a UTF16 code unit, not a unicode code point.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066026#comment-13066026 ] 

Nick Burch commented on TIKA-683:
---------------------------------

I couldn't use the test as-is, as it contains raw japanese characters in an unknown encoding (rather than \uxxxx escape sequences), and the sample file was too large

I've re-saved the sample file without the images, and tested with that. That does extract exactly as expected - no doubling up occurs. I've added a unit test for this in r1147200.

Are you able to get a small RTF file that does shows the problem, along with a suitable unit test similar to the testJapaneseText() method in RTFParser?

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: testRTFJapanese.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092320#comment-13092320 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

I'm now testing the approach of just making our own simple RTF tokenizer, that handles those control words relevant to the text that we need... I'll post a patch once I have something sort of working.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-683.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0

I'll open a follow-on issue for the mis-matched XHTML events from some parsers....

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105413#comment-13105413 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

Thanks Chris, I'll commit today!

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086418#comment-13086418 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

Super, thanks Chris!

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087610#comment-13087610 ] 

Jukka Zitting commented on TIKA-683:
------------------------------------

> Just in case it can't be done with subclassing, anybody know what the licensing
> restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ).

They should be available under GPLv2 from the OpenJDK project.

And it actually looks like Apache Harmony added an initial ALv2-licensed RTF parser
in HARMONY-5903. I haven't tried that code yet.


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089072#comment-13089072 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

Sorry, wrong issue -- that last patch was meant for TIKA-692.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Cristian Vat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087367#comment-13087367 ] 

Cristian Vat commented on TIKA-683:
-----------------------------------

Thanks Mike for looking into the issues. I also know very little about RTF :)

Yes, the skipping is basically skip N ansi chars.
Actually the JDK RTFEditorKit/Reader already does this and does it well as far as I could see.

There are also other flaws with the current filtering we do. For example binary data sequences skipping is not handled correctly...

I went through all the classes in/used-by RTFEditorKit and it appears that it handles most things correctly except the "\'xx" escape where it uses a default translation table not taking into account the current font charset.
Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader. That I think would be the best solution to this issue and other related ones. It would also avoid temporary files and improve performance maybe.

Just in case it can't be done with subclassing, anybody know what the licensing restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ). It may be do-able with modifying them a little...

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-683:
------------------------------------

    Attachment: testWORD_bold_character_runs2.docx
                testWORD_bold_character_runs.docx
                TIKA-683.patch

New patch attached, including the last (pretty-print) patch, plus I noticed that the OOXML Word parser also split up adjacent bold character runs so I fixed that and added 2 docx files for testing.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Cristian Vat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080488#comment-13080488 ] 

Cristian Vat commented on TIKA-683:
-----------------------------------

I managed to take the original file and slim it down to (possibly) the smallest test case. See "testUnicodeUCNControlWordCharacterDoubling.rtf, 566 bytes.

Test file contains only one character ( \u5E74 ). Checked with latest Tika SVN and it is doubled.

The character is defined both as a RTF Unicode escape ( \uXXXX ) and as two RTF charset/font-specific byte escapes ( \'xx ).
The file is correct since it does specify a unicode skip but that is not taken into account.

Checked only with RTFEditorKit and that parses fine.
This is most likely caused by the changes in TIKA-422 which don't take into account \ucN control word and thus show both versions of the character.
I'll try to look over the code and see what can be done.

Note on issue name: Current name isn't very accurate. The doubling could also occur with european characters, it all depends on how the rtf generator chooses to encode some characters. A better one would be: "RTFParser doubling characters in some RTF files".

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-683:
------------------------------------

    Attachment: TIKA-683.patch

New patch; I think it's ready!  Changes from last patch:

  - Factored out separate source files for the TextExtract, GroupState
    classes

  - Added a few more RTF test cases

  - Added optional loading of ICU4J's Charset impl, if available; I
    did this in CharsetUtils.forName

  - Removed dup test cases from TestParsers (they were already
    previously copied to RTFParserTest)
 
  - Cleaned up confusing interleaved bytes/chars buffering in the
    parser

  - Added balanced tag asserts to SafeContentHandler; this helped me
    fix the RTFParser, however, other parsers seem to trip the assert
    (do not produce balanced start/end elements).  I didn't dig into
    this, and commented out the asserts; I'll open a separate issue to
    pursue that.


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086380#comment-13086380 ] 

Chris A. Mattmann commented on TIKA-683:
----------------------------------------

Guys, I see there is a patch from Cristian (looks like the code update) and one from Mike (the test case). Are we seeing that this resolves the issue? If so, I can commit it, with the test case update from Mike (+Robert), and the sample files, but wanted to check first. I have some free cycles, but by no means am a UTF expert, nor a non-european character expert. I'm just willing to help get these committed, and then let you experts tell me whether it works or not :)

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-683:
------------------------------------

    Attachment: TIKA-683.patch

Attached patch, with a first cut at using a simple (shallow) tokenizer
to interpret the specific RTF control words that determine what text
is rendered.  I built this using the 1.9.1 RTF specification:

  http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=10725

It's still rough (many nocommits) but I think it's close.  All tests
pass, including a few new RTF test cases I've added.

I just created a custom tokenizer (the allowed RTF tokens are very
simple) and shallow parser.  I think later we can/should cutover to a
"real" tokenizer/parser (eg JFlex)...

The new parser does a better job at extracting some doc structure; the
current parser just makes a single paragraph, but the new one makes a
paragraph whenever the doc said there was one.  But it doesn't give
structure for tables, lists (it does extract their text).

It finds text that the old parser missed, eg footnotes, hyperlink,
header/footer, text inside a picture, and [generally] does not add
extra whitespace (the old one sometimes breaks a word by inserting a
space).  Finally the new parser fixes the unicode character doubling
(this issue)...

One thing I still have to fix is that it can output mis-matched tags
for i/b styles (spookily nothing failed; maybe we should add simple
validation (under asserts) eg to XHTMLContentHandler?).


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095978#comment-13095978 ] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

Thanks Jukka!  That's a good idea to move the ExtractRTFText class out; I'll do that.

I'll mull how to assert the sax start/end elements are valid...

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-683) RTF Parser issues with non european characters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-683:
---------------------------------------

    Assignee: Michael McCandless  (was: Chris A. Mattmann)

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira