You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Kunal Kashyap <ku...@gmail.com> on 2017/05/29 06:56:48 UTC
Issues regarding PDFBOX
Hi All,
I am trying to read text data from a pdf file using PdfBox API. So ,I want
to skip all the charts data and images in the output .txt file . Can anyone
help me regarding this. Also I want to extract data in proper alignment.
PFA is the sample pdf file and sample .txt file(this is my desired output
file)
Re: Issues regarding PDFBOX
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.05.2017 um 08:56 schrieb Kunal Kashyap:
> I am trying to read text data from a pdf file using PdfBox API. So ,I
> want to skip all the charts data and images in the output .txt file .
> Can anyone help me regarding this. Also I want to extract data in
> proper alignment.
> PFA is the sample pdf file and sample .txt file(this is my desired
> output file)
Please have a look at the ExtractTextByArea.java example in the source
code download, this will allow you to extract from a predefined area.
There is no way in PDF to "exclude tables" because there is no table
concept in PDF like in HTML. It's just a bunch of lines with text. You
would need heuristics to guess what's a table and what isn't.
Re order, use the setSortByPosition() method.
If you want exact positions of everything, have a look at the
PrintTextLocations.java example.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org