You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Kunal Kashyap <ku...@gmail.com> on 2017/05/29 06:56:48 UTC

Issues regarding PDFBOX

Hi All,
I am trying to read text data from a pdf file using PdfBox API. So ,I want
to skip all the charts data and images in the output .txt file . Can anyone
help me regarding this. Also I want to extract data in proper alignment.
PFA is the sample pdf file and sample .txt file(this is my desired output
file)

Re: Issues regarding PDFBOX

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 29.05.2017 um 08:56 schrieb Kunal Kashyap:
> I am trying to read text data from a pdf file using PdfBox API. So ,I 
> want to skip all the charts data and images in the output .txt file . 
> Can anyone help me regarding this. Also I want to extract data in 
> proper alignment.
> PFA is the sample pdf file and sample .txt file(this is my desired 
> output file)

Please have a look at the ExtractTextByArea.java example in the source 
code download, this will allow you to extract from a predefined area.

There is no way in PDF to "exclude tables" because there is no table 
concept in PDF like in HTML. It's just a bunch of lines with text. You 
would need heuristics to guess what's a table and what isn't.

Re order, use the setSortByPosition() method.

If you want exact positions of everything, have a look at the 
PrintTextLocations.java example.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org