You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by David Hoffer <dh...@gmail.com> on 2010/12/29 05:26:01 UTC

Context aware text extraction

Hi I'm new to PDFBox and need to do PDF text extraction but the standard
PDFTextStripper behavior isn't what I need.  The problem with
PDFTextStripper is that it left aligns all the output so you have no way of
knowing where in the horizontial position the text came from.

I have to extract text from (small) tables within the document and I need to
know which table the data came from.  A simple example might be:

Table 1        Table2
1 2 3 4         1 2 3 4
                   5 6 7 8

PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are
fine but it will left align row 3 so there is no way of knowing that it was
part of Table 2 and not Table 1.

What I can't show is there there is table formatting (rectangles) around all
the tables.

How can I use PDFBox to extract the data keeping it context aware?  Ideally
getting each table (I know the text in the top) and then extracting text
like PDFTextStripper does would be great.

What's the best way to do this?

-Dave

Re: Context aware text extraction

Posted by David Hoffer <dh...@gmail.com>.
Yeah I'm having some luck...its not elegant but it's working.

What I'm doing is looking for the table header text and finding it's
starting and ending X pos,  then because I know about how wide my table is I
extract all the subsequent rows that are withing this X range.

It's got lots of issues that are not ideal.
- Sometimes the table header (something that makes a unique string to look
for) is two or three rows...I can't handle this.  Btw, I use regex for the
header text because you can't be certain of how many spaces will be in the
string.
- It would be nice if it could figure out how wide the table is...it has the
boundary/rectangle info...but I don't know how to get this info so I am
telling it how wide the table is.
- I have to tell it what the max number of table rows is...because again I
don't know how to get the boundary/rectangle info which knows where the
table ends.

Other than this...it's working.

-Dave

P.S. The newer iText has a context aware parsing strategy...but it costs
thousands of $...too rich for me.


On Thu, Dec 30, 2010 at 8:37 AM, Kevin Brown <kb...@gmail.com> wrote:

> Any luck with this? I couldn't figure a way to do this with PDFBox, or
> anything else.
>
> The best tool I've ever seen is something called BCL Jade which allows you
> to extract zones by selecting them. It's non programmable and not supported
> or sold any more!
>
> On Tue, Dec 28, 2010 at 11:26 PM, David Hoffer <dh...@gmail.com> wrote:
>
> > Hi I'm new to PDFBox and need to do PDF text extraction but the standard
> > PDFTextStripper behavior isn't what I need.  The problem with
> > PDFTextStripper is that it left aligns all the output so you have no way
> of
> > knowing where in the horizontial position the text came from.
> >
> > I have to extract text from (small) tables within the document and I need
> > to
> > know which table the data came from.  A simple example might be:
> >
> > Table 1        Table2
> > 1 2 3 4         1 2 3 4
> >                   5 6 7 8
> >
> > PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are
> > fine but it will left align row 3 so there is no way of knowing that it
> was
> > part of Table 2 and not Table 1.
> >
> > What I can't show is there there is table formatting (rectangles) around
> > all
> > the tables.
> >
> > How can I use PDFBox to extract the data keeping it context aware?
>  Ideally
> > getting each table (I know the text in the top) and then extracting text
> > like PDFTextStripper does would be great.
> >
> > What's the best way to do this?
> >
> > -Dave
> >
>

Re: Context aware text extraction

Posted by Kevin Brown <kb...@gmail.com>.
Any luck with this? I couldn't figure a way to do this with PDFBox, or
anything else.

The best tool I've ever seen is something called BCL Jade which allows you
to extract zones by selecting them. It's non programmable and not supported
or sold any more!

On Tue, Dec 28, 2010 at 11:26 PM, David Hoffer <dh...@gmail.com> wrote:

> Hi I'm new to PDFBox and need to do PDF text extraction but the standard
> PDFTextStripper behavior isn't what I need.  The problem with
> PDFTextStripper is that it left aligns all the output so you have no way of
> knowing where in the horizontial position the text came from.
>
> I have to extract text from (small) tables within the document and I need
> to
> know which table the data came from.  A simple example might be:
>
> Table 1        Table2
> 1 2 3 4         1 2 3 4
>                   5 6 7 8
>
> PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are
> fine but it will left align row 3 so there is no way of knowing that it was
> part of Table 2 and not Table 1.
>
> What I can't show is there there is table formatting (rectangles) around
> all
> the tables.
>
> How can I use PDFBox to extract the data keeping it context aware?  Ideally
> getting each table (I know the text in the top) and then extracting text
> like PDFTextStripper does would be great.
>
> What's the best way to do this?
>
> -Dave
>