You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Ray Weidner <ra...@gmail.com> on 2011/08/22 23:13:02 UTC

extracting vector graphics

Hi,

I'm currently using PDFBox to provide me with text/location information in
order to heuristically detect table structures in a document.  One way I'd
like to enhance this is by making use of actual grid lines, when they are
present.  To do this, I believe I need to extract the vector graphics
commands from the document.

I found one helpful post on this matter in the mail archives (
http://mail-archives.apache.org/mod_mbox/pdfbox-users/200902.mbox/browser).
The recommendation was simply to override PageDrawer in order to intercept
graphics commands.  This sounds like a good idea, but I'm totally unsure of
how to interpret the calls that I should be intercepting.  Can anyone give
me some advice here, or point me to a document that should make things
clearer?

Please be aware that I am both a newbie to PDFBox as well as the PDF
document standard, so don't assume too much about what I already know.
Thanks in advance.

Ray Weidner

Re: extracting vector graphics

Posted by Ray Weidner <ra...@gmail.com>.

Thanks for the advice, Daniel.  This helps to point me in the right
direction.  However, I'm still a bit confused, so please bear with me.

Looking at the code to PageDrawer#drawPage, I see that it is iterating
through the PDAnnotations that belong to the PDPage object.  This leads me
to two questions:

1) If I already have the PDPage object, then isn't it unnecessary to call
PageDrawer i.e. can't I just read the PDAnnotations from wherever I'd be
calling #drawPage?

2) I'm still not sure how to interpret these objects to obtain the vector
graphics information.  I'm guessing that it's a matter of iterating through
the PDAnnotations and looking for vector graphics commands.  But is this
correct or am I missing something?

Sorry for the total newbieness, and thanks for any help you (or anyone else)
can provide.

Ray Weidner

On Mon, Aug 22, 2011 at 6:41 PM, Daniel Wilson <
williamstonconsulting@gmail.com> wrote:

> The big one to override, IMO, is drawPage.
>
> For a different application, I also override:
>
>   - processTextPosition
>   - fillPath
>   - setStroke
>   - getStroke
>   - strokePath -- this might be key for your application ...
>   - drawImage
>
> hope this helps.
>
> Daniel
>
> On Mon, Aug 22, 2011 at 5:13 PM, Ray Weidner <
> ray.weidner.developer@gmail.com> wrote:
>
> > Hi,
> >
> > I'm currently using PDFBox to provide me with text/location information
> in
> > order to heuristically detect table structures in a document.  One way
> I'd
> > like to enhance this is by making use of actual grid lines, when they are
> > present.  To do this, I believe I need to extract the vector graphics
> > commands from the document.
> >
> > I found one helpful post on this matter in the mail archives (
> >
> http://mail-archives.apache.org/mod_mbox/pdfbox-users/200902.mbox/browser
> > ).
> > The recommendation was simply to override PageDrawer in order to
> intercept
> > graphics commands.  This sounds like a good idea, but I'm totally unsure
> of
> > how to interpret the calls that I should be intercepting.  Can anyone
> give
> > me some advice here, or point me to a document that should make things
> > clearer?
> >
> > Please be aware that I am both a newbie to PDFBox as well as the PDF
> > document standard, so don't assume too much about what I already know.
> > Thanks in advance.
> >
> > Ray Weidner
> >
>

Re: extracting vector graphics

Posted by Daniel Wilson <wi...@gmail.com>.

The big one to override, IMO, is drawPage.

For a different application, I also override:

   - processTextPosition
   - fillPath
   - setStroke
   - getStroke
   - strokePath -- this might be key for your application ...
   - drawImage

hope this helps.

Daniel

On Mon, Aug 22, 2011 at 5:13 PM, Ray Weidner <
ray.weidner.developer@gmail.com> wrote:

> Hi,
>
> I'm currently using PDFBox to provide me with text/location information in
> order to heuristically detect table structures in a document.  One way I'd
> like to enhance this is by making use of actual grid lines, when they are
> present.  To do this, I believe I need to extract the vector graphics
> commands from the document.
>
> I found one helpful post on this matter in the mail archives (
> http://mail-archives.apache.org/mod_mbox/pdfbox-users/200902.mbox/browser
> ).
> The recommendation was simply to override PageDrawer in order to intercept
> graphics commands.  This sounds like a good idea, but I'm totally unsure of
> how to interpret the calls that I should be intercepting.  Can anyone give
> me some advice here, or point me to a document that should make things
> clearer?
>
> Please be aware that I am both a newbie to PDFBox as well as the PDF
> document standard, so don't assume too much about what I already know.
> Thanks in advance.
>
> Ray Weidner
>