You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Matthew Aguirre <ma...@artistech.com> on 2010/01/26 22:22:58 UTC

Arabic Text

Sorry if this get this twice, I accidentally sent this to the wrong list 
first.

I have been looking around and I saw where the issue with extracted 
Arabic words being written in reverse was fixed, but I'm seeing an issue 
where the extracted Arabic text of a sentence is in reverse. I assume 
this is due to Arabic being a left-to-right language. Is there anyway to 
detect this and have pdfbox extract the text in the correct order?

Expected Arabic Text:
??????? ?????? ?????? ??????? ??????? ??????

Returned Arabic Text:
?????? ?????? ??????? ?????? ????? ???????

I am using the latest version (0.8.0-incubating).
Is there something else that I am missing?
-- 
Matt


Re: Arabic Text

Posted by "Erik Scholtz, ArgonSoft GmbH" <es...@argonsoft.de>.
Matt,

I hope, this is the information you need (from the README):

You get text that has the correct characters, but in the wrong order. 
This might be because you have not enabled sorting.  The text in PDF 
files is stored in chunks and the chunks do not need to be stored in the 
order that they are displayed on a page.  By default, PDFBox does not 
sort the text.  Also, if you have text in a language that reads right to 
left (such as Arabic or Hebrew), make sure you have the ICU4J jar file 
in your classpath.  This library is needed to properly handle right to 
left text.

Cheers,
Erik


Matthew Aguirre wrote:
> Sorry if this get this twice, I accidentally sent this to the wrong list 
> first.
> 
> I have been looking around and I saw where the issue with extracted 
> Arabic words being written in reverse was fixed, but I'm seeing an issue 
> where the extracted Arabic text of a sentence is in reverse. I assume 
> this is due to Arabic being a left-to-right language. Is there anyway to 
> detect this and have pdfbox extract the text in the correct order?
> 
> Expected Arabic Text:
> ??????? ?????? ?????? ??????? ??????? ??????
> 
> Returned Arabic Text:
> ?????? ?????? ??????? ?????? ????? ???????
> 
> I am using the latest version (0.8.0-incubating).
> Is there something else that I am missing?