You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Hesham G." <he...@gmail.com> on 2011/04/02 15:24:41 UTC

Wrong extracted text order from a PDF

Hello ,

I have a PDF file that I am extracting data from it using PDFBox v1.5. If i copy text from it manually like: "SUPPLY FAN | G0320 B11-14998" to Notepad, it is copied fine ... But in PDFBox it is read like this: "SUPPLY FAN | B11-14998G0320" ... Many other text does the same thing. You can test a 1 page sample PDF here : http://www.4shared.com/document/XDzWQFyY/wrong_extracted_text_sample.html


Best regards ,
Hesham

Re: Wrong extracted text order from a PDF

Posted by "Hesham G." <he...@gmail.com>.
Jukka ,

As always ... Sorry for being late to reply.
I have just tested this now ... And it extracts the text just fine.


Best regards ,
Hesham

---------------------------------------------
Included message :

> Hi,
> 
> On 04/02/2011 03:24 PM, Hesham G. wrote:
>> I have a PDF file that I am extracting data from it using PDFBox
>> v1.5. If i copy text from it manually like: "SUPPLY FAN | G0320
>> B11-14998" to Notepad, it is copied fine ... But in PDFBox it is read
>> like this: "SUPPLY FAN | B11-14998G0320" ... Many other text does the
>> same thing. You can test a 1 page sample PDF here :
>> http://www.4shared.com/document/XDzWQFyY/wrong_extracted_text_sample.html
> 
> Enabling the sortByPosition option [1] in the text extraction typically 
> helps solve problems like this. See also the equivalent -sort option of 
> the ExtractText command [2].
> 
> [1] 
> http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html#setSortByPosition(boolean)
> [2] http://pdfbox.apache.org/commandlineutilities/ExtractText.html
> 
> --
> Jukka Zitting
>

Re: Wrong extracted text order from a PDF

Posted by Jukka Zitting <jz...@adobe.com>.
Hi,

On 04/02/2011 03:24 PM, Hesham G. wrote:
> I have a PDF file that I am extracting data from it using PDFBox
> v1.5. If i copy text from it manually like: "SUPPLY FAN | G0320
> B11-14998" to Notepad, it is copied fine ... But in PDFBox it is read
> like this: "SUPPLY FAN | B11-14998G0320" ... Many other text does the
> same thing. You can test a 1 page sample PDF here :
> http://www.4shared.com/document/XDzWQFyY/wrong_extracted_text_sample.html

Enabling the sortByPosition option [1] in the text extraction typically 
helps solve problems like this. See also the equivalent -sort option of 
the ExtractText command [2].

[1] 
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html#setSortByPosition(boolean)
[2] http://pdfbox.apache.org/commandlineutilities/ExtractText.html

--
Jukka Zitting