You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ray Weidner <ra...@gmail.com> on 2011/08/22 23:13:02 UTC
extracting vector graphics
Hi,
I'm currently using PDFBox to provide me with text/location information in
order to heuristically detect table structures in a document. One way I'd
like to enhance this is by making use of actual grid lines, when they are
present. To do this, I believe I need to extract the vector graphics
commands from the document.
I found one helpful post on this matter in the mail archives (
http://mail-archives.apache.org/mod_mbox/pdfbox-users/200902.mbox/browser).
The recommendation was simply to override PageDrawer in order to intercept
graphics commands. This sounds like a good idea, but I'm totally unsure of
how to interpret the calls that I should be intercepting. Can anyone give
me some advice here, or point me to a document that should make things
clearer?
Please be aware that I am both a newbie to PDFBox as well as the PDF
document standard, so don't assume too much about what I already know.
Thanks in advance.
Ray Weidner
Re: extracting vector graphics
Posted by Ray Weidner <ra...@gmail.com>.
Thanks for the advice, Daniel. This helps to point me in the right
direction. However, I'm still a bit confused, so please bear with me.
Looking at the code to PageDrawer#drawPage, I see that it is iterating
through the PDAnnotations that belong to the PDPage object. This leads me
to two questions:
1) If I already have the PDPage object, then isn't it unnecessary to call
PageDrawer i.e. can't I just read the PDAnnotations from wherever I'd be
calling #drawPage?
2) I'm still not sure how to interpret these objects to obtain the vector
graphics information. I'm guessing that it's a matter of iterating through
the PDAnnotations and looking for vector graphics commands. But is this
correct or am I missing something?
Sorry for the total newbieness, and thanks for any help you (or anyone else)
can provide.
Ray Weidner
On Mon, Aug 22, 2011 at 6:41 PM, Daniel Wilson <
williamstonconsulting@gmail.com> wrote:
> The big one to override, IMO, is drawPage.
>
> For a different application, I also override:
>
> - processTextPosition
> - fillPath
> - setStroke
> - getStroke
> - strokePath -- this might be key for your application ...
> - drawImage
>
> hope this helps.
>
> Daniel
>
> On Mon, Aug 22, 2011 at 5:13 PM, Ray Weidner <
> ray.weidner.developer@gmail.com> wrote:
>
> > Hi,
> >
> > I'm currently using PDFBox to provide me with text/location information
> in
> > order to heuristically detect table structures in a document. One way
> I'd
> > like to enhance this is by making use of actual grid lines, when they are
> > present. To do this, I believe I need to extract the vector graphics
> > commands from the document.
> >
> > I found one helpful post on this matter in the mail archives (
> >
> http://mail-archives.apache.org/mod_mbox/pdfbox-users/200902.mbox/browser
> > ).
> > The recommendation was simply to override PageDrawer in order to
> intercept
> > graphics commands. This sounds like a good idea, but I'm totally unsure
> of
> > how to interpret the calls that I should be intercepting. Can anyone
> give
> > me some advice here, or point me to a document that should make things
> > clearer?
> >
> > Please be aware that I am both a newbie to PDFBox as well as the PDF
> > document standard, so don't assume too much about what I already know.
> > Thanks in advance.
> >
> > Ray Weidner
> >
>
Re: extracting vector graphics
Posted by Daniel Wilson <wi...@gmail.com>.
The big one to override, IMO, is drawPage.
For a different application, I also override:
- processTextPosition
- fillPath
- setStroke
- getStroke
- strokePath -- this might be key for your application ...
- drawImage
hope this helps.
Daniel
On Mon, Aug 22, 2011 at 5:13 PM, Ray Weidner <
ray.weidner.developer@gmail.com> wrote:
> Hi,
>
> I'm currently using PDFBox to provide me with text/location information in
> order to heuristically detect table structures in a document. One way I'd
> like to enhance this is by making use of actual grid lines, when they are
> present. To do this, I believe I need to extract the vector graphics
> commands from the document.
>
> I found one helpful post on this matter in the mail archives (
> http://mail-archives.apache.org/mod_mbox/pdfbox-users/200902.mbox/browser
> ).
> The recommendation was simply to override PageDrawer in order to intercept
> graphics commands. This sounds like a good idea, but I'm totally unsure of
> how to interpret the calls that I should be intercepting. Can anyone give
> me some advice here, or point me to a document that should make things
> clearer?
>
> Please be aware that I am both a newbie to PDFBox as well as the PDF
> document standard, so don't assume too much about what I already know.
> Thanks in advance.
>
> Ray Weidner
>