You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Christopher Creutzig (JIRA)" <ji...@apache.org> on 2017/02/02 08:04:51 UTC
[jira] [Commented] (TIKA-2257) Arabic vowel marks displaced when
reading from PDF
[ https://issues.apache.org/jira/browse/TIKA-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849613#comment-15849613 ]
Christopher Creutzig commented on TIKA-2257:
--------------------------------------------
[~tallison@mitre.org], thanks for creating PDFBOX-3674 on my behalf!
> Arabic vowel marks displaced when reading from PDF
> --------------------------------------------------
>
> Key: TIKA-2257
> URL: https://issues.apache.org/jira/browse/TIKA-2257
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.14
> Reporter: Christopher Creutzig
> Attachments: mixed-fonts.pdf
>
>
> The attached file, in its second line, contains “العَرَبِيَّة”. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” from the system print dialog started from Microsoft Word.
> Reading the text from the PDF file, the short a vowel marks are displaced, returning
> U+0627 U+0644 _U+064E_ U+0639 _U+064E_ U+0631 U+0628 U+0650 *U+06CC* U+0651 _U+064E_ U+0629 instead of the expected
> U+0627 U+0644 U+0639 _U+064E_ U+0631 _U+064E_ U+0628 U+0650 *U+064A* _U+064E_ U+0651 U+0629 (الَعَربِیَّة instead of العَرَبِيَّة).
> Here is the (MATLAB) code used for reading:
> handler = org.apache.tika.sax.ToXMLContentHandler;
> parser = org.apache.tika.parser.AutoDetectParser;
> metadata = org.apache.tika.metadata.Metadata;
> fh = java.io.FileInputStream(fullname);
> parser.parse(fh, handler, metadata);
> s = string(handler.toString);
> fh.close;
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)