You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Robert Rodini <rr...@hotmail.com> on 2023/05/19 13:17:13 UTC

Text sequence of ExtractText utility

Hi,
I have successfully used PDFBox ExtractText utility to process PDFs produced by a third-party.  The text comes out of a multicolumn PDF in the left to right order of the columns from top to bottom.

I now have to process PDFs produced by another third-party which also produces a multicolumn PDF.  This time the text comes out in an unpredictable order.

I've read the FAQ https://pdfbox.apache.org/2.0/faq.html regarding "Why does the extracted text appear in the wrong sequence?"

I'd like to know if there is a command line switch (or something) that I can do to get the text extracted in the right order?  Can I request an CLI switch to the ExtractText utility?  How to do this?

Thanks,
Bob Rodini

Re: Text sequence of ExtractText utility

Posted by Robert Rodini <rr...@hotmail.com>.
The -sort option did not solve the problem.

I tried the alpha release of PDFBox 3.0 and it produced the same results as the 2.0 version.

Note: Command line parameters are different in PDFBox 3.0.

Bob
________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Friday, May 19, 2023 9:22 AM
To: users@pdfbox.apache.org <us...@pdfbox.apache.org>
Subject: Re: Text sequence of ExtractText utility

Hi,

You can try the "-sort" option. Sometimes this helps.

Tilman

[cid:part1.MGjatbHN.HW6Dh5r7@t-online.de]

On 19.05.2023 15:17, Robert Rodini wrote:

Hi,
I have successfully used PDFBox ExtractText utility to process PDFs produced by a third-party.  The text comes out of a multicolumn PDF in the left to right order of the columns from top to bottom.

I now have to process PDFs produced by another third-party which also produces a multicolumn PDF.  This time the text comes out in an unpredictable order.

I've read the FAQ https://pdfbox.apache.org/2.0/faq.html regarding "Why does the extracted text appear in the wrong sequence?"

I'd like to know if there is a command line switch (or something) that I can do to get the text extracted in the right order?  Can I request an CLI switch to the ExtractText utility?  How to do this?

Thanks,
Bob Rodini




Re: Text sequence of ExtractText utility

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

You can try the "-sort" option. Sometimes this helps.

Tilman


On 19.05.2023 15:17, Robert Rodini wrote:
> Hi,
> I have successfully used PDFBox ExtractText utility to process PDFs produced by a third-party.  The text comes out of a multicolumn PDF in the left to right order of the columns from top to bottom.
>
> I now have to process PDFs produced by another third-party which also produces a multicolumn PDF.  This time the text comes out in an unpredictable order.
>
> I've read the FAQhttps://pdfbox.apache.org/2.0/faq.html  regarding "Why does the extracted text appear in the wrong sequence?"
>
> I'd like to know if there is a command line switch (or something) that I can do to get the text extracted in the right order?  Can I request an CLI switch to the ExtractText utility?  How to do this?
>
> Thanks,
> Bob Rodini
>