You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Matthew Aguirre <ma...@artistech.com> on 2010/01/26 22:22:58 UTC
Arabic Text
Sorry if this get this twice, I accidentally sent this to the wrong list
first.
I have been looking around and I saw where the issue with extracted
Arabic words being written in reverse was fixed, but I'm seeing an issue
where the extracted Arabic text of a sentence is in reverse. I assume
this is due to Arabic being a left-to-right language. Is there anyway to
detect this and have pdfbox extract the text in the correct order?
Expected Arabic Text:
??????? ?????? ?????? ??????? ??????? ??????
Returned Arabic Text:
?????? ?????? ??????? ?????? ????? ???????
I am using the latest version (0.8.0-incubating).
Is there something else that I am missing?
--
Matt
Re: Arabic Text
Posted by "Erik Scholtz, ArgonSoft GmbH" <es...@argonsoft.de>.
Matt,
I hope, this is the information you need (from the README):
You get text that has the correct characters, but in the wrong order.
This might be because you have not enabled sorting. The text in PDF
files is stored in chunks and the chunks do not need to be stored in the
order that they are displayed on a page. By default, PDFBox does not
sort the text. Also, if you have text in a language that reads right to
left (such as Arabic or Hebrew), make sure you have the ICU4J jar file
in your classpath. This library is needed to properly handle right to
left text.
Cheers,
Erik
Matthew Aguirre wrote:
> Sorry if this get this twice, I accidentally sent this to the wrong list
> first.
>
> I have been looking around and I saw where the issue with extracted
> Arabic words being written in reverse was fixed, but I'm seeing an issue
> where the extracted Arabic text of a sentence is in reverse. I assume
> this is due to Arabic being a left-to-right language. Is there anyway to
> detect this and have pdfbox extract the text in the correct order?
>
> Expected Arabic Text:
> ??????? ?????? ?????? ??????? ??????? ??????
>
> Returned Arabic Text:
> ?????? ?????? ??????? ?????? ????? ???????
>
> I am using the latest version (0.8.0-incubating).
> Is there something else that I am missing?