You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2011/09/02 00:22:09 UTC

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095644#comment-13095644 ] 

Uwe Schindler commented on TIKA-683:
------------------------------------

XML SAX Handling does not validate the element names, like opening and closing elements are the same. And the serializer in most cases only outputs the elements it get reported, some of those serializers will go crazy :-)

The reason for this is, because SAX is in general seldom used to generate xml documents, its more XML parsers that report elements they found in an XML document. And those parsers do the validating before, so theoretically, your parser must do this. For speed reasons there are no checks in serializers. You can enforce checks by piping the whole stuff through javax.xml.validator API, but this would also check a schema, which does not really exists for XHTML.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira