You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Ray Weidner <ra...@gmail.com> on 2012/01/26 22:31:15 UTC

extracting grid lines for PDF tables

Hi,

I'm currently using PDFBox for an application that detects table structures
in PDF documents.  So far, I do this by extending PDFTextStripper, and
using the character position and font data to heuristically detect
table-like text formatting.  This is working pretty well, but we want to
improve this, if possible, by analyzing vector graphics to detect
table-like grid lines.  This will definitely improve accuracy, and make it
easier to parse more complex table structures.

So how can I do this, and is it even possible?  I'm not at all an expert of
PDFBox or the PDF standard, so I don't yet know if this can be done (for
instance, if tables grids are usually formed from background images, this
is probably not feasible within our time frame).  Please bear with my
newbishness.

Thanks in advance!

Ray Weidner