You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Ahmet (JIRA)" <ji...@apache.org> on 2014/09/26 09:45:33 UTC

[jira] [Updated] (PDFBOX-2382) Arabic compound words are displayed incorrectly

     [ https://issues.apache.org/jira/browse/PDFBOX-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmet updated PDFBOX-2382:
--------------------------
    Attachment: arabicDoc2.pdf
                arabicDoc2.doc

> Arabic compound words are displayed incorrectly
> -----------------------------------------------
>
>                 Key: PDFBOX-2382
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2382
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>         Environment: Windows 7, NetBeans 8.0, Java 8
>            Reporter: Ahmet
>         Attachments: arabicDoc2.doc, arabicDoc2.pdf
>
>
> Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of texts but real texts) to html. PdfBox works really good in most cases however, it does have problems in recognizing compound characters. I am attaching you a sample pdf file. In that e.g. I get &#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  &#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The pdfBox misses the bit highlighted red.   The same is valid for:  &#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)  Has this maybe to do with the encodings? I hope you can help me on this matter.
> I know this was somewhat reported and the results said that this issue is due to how the pdf file is generated. Is there a way to generate a "correct" pdf file so PDFBox does perform correct text extraction? I created the attached file using OpenOffice 4.0. The original document is in MS Word format and was converted with OpenOffice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)