You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/09/26 10:40:33 UTC

[jira] [Closed] (PDFBOX-2382) Arabic compound words are displayed incorrectly

     [ https://issues.apache.org/jira/browse/PDFBOX-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-2382.
-------------------------------
    Resolution: Not a Problem

This is a problem with how the PDF file has been generated, the Unicode text has not been correctly embedded. Adobe Acrobat cannot extract this text either.

This is caused a bug and/or limitation in OpenOffice's PDF exporter, you might want to report this on the OpenOffice Bugzilla. I don't see any relevant open issues, so I can't offer any workarounds, sorry.

> Arabic compound words are displayed incorrectly
> -----------------------------------------------
>
>                 Key: PDFBOX-2382
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2382
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>         Environment: Windows 7, NetBeans 8.0, Java 8
>            Reporter: Ahmet
>         Attachments: arabicDoc2.doc, arabicDoc2.pdf
>
>
> Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of texts but real texts) to html. PdfBox works really good in most cases however, it does have problems in recognizing compound characters. I am attaching you a sample pdf file. In that e.g. I get &#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  &#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The pdfBox misses the bit highlighted red.   The same is valid for:  &#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)  Has this maybe to do with the encodings? I hope you can help me on this matter.
> I know this was somewhat reported and the results said that this issue is due to how the pdf file is generated. Is there a way to generate a "correct" pdf file so PDFBox does perform correct text extraction? I created the attached file using OpenOffice 4.0. The original document is in MS Word format and was converted with OpenOffice into pdf. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)