You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Ahmet Aker <ah...@sheffield.ac.uk> on 2014/09/25 16:26:43 UTC

Arabic compound characters not recognized by pdfbox

Hi,

I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases
however, it does have problems in recognizing compound characters. I am
attaching you a sample pdf file. In that e.g. I get
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The
pdfBox misses the bit highlighted red.   The same is valid for:

 

&#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)

 

Has this maybe to do with the encodings? I hope you can help me on this
matter.

 

Many thanks,

ahmet

RE: Arabic compound characters not recognized by pdfbox

Posted by Ahmet Aker <ah...@sheffield.ac.uk>.

Hi John,
Thanks for coming back to me. I will attach my pdf document to the JIRA.

Thanks,
ahmet

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: 25 September 2014 18:57
To: dev@pdfbox.apache.org
Subject: Re: Arabic compound characters not recognized by pdfbox

Hi Ahmet

We’re currently looking into a similar problem
https://issues.apache.org/jira/browse/PDFBOX-2259

If you think this is the *exact* same problem that you’re seeing, please
attach your PDF file to that JIRA issue, if not then please open a new JIRA
issue and attach your file. (You can attach files in JIRA using More >
Attach Files).

The mailing list does not support file attachments, so we can’t see your
file unless it is on JIRA.

Thanks

-- John

On 25 Sep 2014, at 07:26, Ahmet Aker <ah...@sheffield.ac.uk> wrote:

> Hi,
> I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases
however, it does have problems in recognizing compound characters. I am
attaching you a sample pdf file. In that e.g. I get
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The
pdfBox misses the bit highlighted red.   The same is valid for:
>  
> &#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)
>  
> Has this maybe to do with the encodings? I hope you can help me on this
matter.
>  
> Many thanks,
> ahmet

Re: Arabic compound characters not recognized by pdfbox

Posted by John Hewson <jo...@jahewson.com>.

Hi Ahmet

We’re currently looking into a similar problem https://issues.apache.org/jira/browse/PDFBOX-2259

If you think this is the *exact* same problem that you’re seeing, please attach your PDF file to that JIRA issue, if not then please open a new JIRA issue and attach your file. (You can attach files in JIRA using More > Attach Files).

The mailing list does not support file attachments, so we can’t see your file unless it is on JIRA.

Thanks

-- John

On 25 Sep 2014, at 07:26, Ahmet Aker <ah...@sheffield.ac.uk> wrote:

> Hi,
> I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of texts but real texts) to html. PdfBox works really good in most cases however, it does have problems in recognizing compound characters. I am attaching you a sample pdf file. In that e.g. I get &#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  &#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The pdfBox misses the bit highlighted red.   The same is valid for:
>  
> &#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)
>  
> Has this maybe to do with the encodings? I hope you can help me on this matter.
>  
> Many thanks,
> ahmet