You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2008/09/19 22:08:44 UTC

[jira] Updated: (PDFBOX-377) Incorrect direction of extracted Arabic Text

     [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier updated PDFBOX-377:
---------------------------------

    Attachment: PDFTextStripper.diff
                hello3.pdf

Example file and diff against trunk.

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.