You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Trasca Virgil <vi...@yahoo.com> on 2011/02/11 14:33:57 UTC

Words/characters order is not preserved during text extraction

Hi,

 
Did anybody have this issue before? You can see in the attached screen shot the 
original text in the document is

<0>652.5</0> while the extracted text is 652.5<0> </0>. I am using PDFBox 1.4.0

I get this behavior with both ExtracText application and with the
PDFTextStripper class. 

What could be the cause for this? Is there any solution or work around to this? 

Thanks,
Virgil