You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/01/31 20:27:51 UTC

[jira] (TIKA-2257) Arabic vowel marks displaced when reading from PDF

    [ https://issues.apache.org/jira/browse/TIKA-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847451#comment-15847451 ] 

Tim Allison commented on TIKA-2257:
-----------------------------------

This may be an issue with PDFBox.  Please open an issue on PDFBox's Jira and link to this issue.

This is what I see:

||Character in PDF||Ascii (hex)||Unicode Mapping||Description||
|!|21|fe94|...|
|"|22|fc60|...|
|#|23|fbff|...|
|$|24|0650|...|
|%|25|fe91|...|
|&|26|064e|fatha|
|'|27|feae|reh final|
|&|26|064e|fatha|
|(|28|fecc|ain (medial)|
|)|29|fedf|lam (initial)|
|*|2a|0627|alef|

Obviously, read this from bottom to top... The fatha is actually positioned after the ain...the correct order  

In PDFTextStripper's {{handleDirection()}}, the flip has already happened:
{noformat}
...
0639
064e
0644
0627
{noformat}

In fact, the flip has already happened by {{normalizeWord}}...more digging by someone more familiar with the PDFBox code is necessary.


> Arabic vowel marks displaced when reading from PDF
> --------------------------------------------------
>
>                 Key: TIKA-2257
>                 URL: https://issues.apache.org/jira/browse/TIKA-2257
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>         Attachments: mixed-fonts.pdf
>
>
> The attached file, in its second line, contains “العَرَبِيَّة”. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” from the system print dialog started from Microsoft Word.
> Reading the text from the PDF file, the short a vowel marks are displaced, returning
> U+0627 U+0644 _U+064E_ U+0639 _U+064E_ U+0631 U+0628 U+0650 *U+06CC* U+0651 _U+064E_ U+0629 instead of the expected
> U+0627 U+0644 U+0639 _U+064E_ U+0631 _U+064E_ U+0628 U+0650 *U+064A* _U+064E_ U+0651 U+0629 (الَعَربِیَّة instead of العَرَبِيَّة).
> Here is the (MATLAB) code used for reading:
>   handler = org.apache.tika.sax.ToXMLContentHandler;
>   parser = org.apache.tika.parser.AutoDetectParser;
>   metadata = org.apache.tika.metadata.Metadata;
>   fh = java.io.FileInputStream(fullname);
>   parser.parse(fh, handler, metadata);
>   s = string(handler.toString);
>   fh.close;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)