You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2011/02/16 17:53:24 UTC

[jira] Commented: (TIKA-469) The Parser is not correctly outputting Arabic text documents

    [ https://issues.apache.org/jira/browse/TIKA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995382#comment-12995382 ] 

Ken Krugler commented on TIKA-469:
----------------------------------

Hi Robert - do you have an example of an HTML file?

I'm asking because if an HTML document is encoded as UTF-8, the only reasona I can think of for the character encoding to be messed up is if (a) the HTML meta tag uses an encoding name that isn't supported by Java, or (b) there is no charset specified in the response header or the HTML meta tags, and the algorithmic detection of the character encoding is also failing.

Thanks,

-- Ken

> The Parser is not correctly outputting Arabic text documents
> ------------------------------------------------------------
>
>                 Key: TIKA-469
>                 URL: https://issues.apache.org/jira/browse/TIKA-469
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows XP
>            Reporter: Robert Cullen
>         Attachments: TEST_WORD.doc, fever_factsheet_arabic.pdf
>
>
> The parser is not preserving the character encoding when parsing documents in Arabic UTF-8, specifically with .pdf and .doc.  The resulting character output is undechipherable or just question-mark symbols.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira