You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Kaushlendra Singh <si...@gmail.com> on 2020/10/13 17:08:11 UTC

Table Extraction

Hi,
I need to extract meaningful text from tables present in a PDF document.
PDFBox doesn't support any such API directly but while searching through I
got https://gist.github.com/beldaz/8ed6e7473bd228fcee8d4a3e4525be11 which
helped me getting meaningful text which internally involves creating the
regions within the document and then fetching the text from the regions.
But other than text, I also need coordinates for a line and words present
in that line of row/column. Is there anyway by which I can achieve the same?
Any guidance would be much helpful.

-- 
Thanks & Regards
Kaushlendra Singh
Email: singh.kaushlendra016@gmail.com
Phone: +91 8377094564

Re: Table Extraction

Posted by Peter Murray-Rust <pe...@googlemail.com.INVALID>.

A word of warning - Extracting tables generally is very hard. I spent last
year developing code based on PDFBox to extract data *automatically* from a
very limited subset of tables. It may be easier if you can manually
interact with each table but that takes time.
(Also see Tabula which has pioneered the interactive extraction of tables
using PDFBox.)

If you have very simple rectangular tables where everything  is in a (row,
column) cell with consistent gutters and placements you can achieve
something.  However tables vary widely and wildly in their formatting. If
you are prepared to manually identify and clip tables that makes it easier.
But there are many problems. multiline text in cells? sub/superscripts?
tree-structured column headings? tables running over multiple pages?
rotated tables? tables which are actually lists, matrices, layouts, etc.

If you are doing a *large* amount of mining tables in a consistent format
then it can probably be automated to quite an extent. I can point you at my
code (http://github.com/petermr/ami3 - based on PDFBox). Otherwise try the
manual Tabula https://tabula.technology/

P.

.

On Tue, Oct 13, 2020 at 6:08 PM Kaushlendra Singh <
singh.kaushlendra016@gmail.com> wrote:

> Hi,
> I need to extract meaningful text from tables present in a PDF document.
> PDFBox doesn't support any such API directly but while searching through I
> got https://gist.github.com/beldaz/8ed6e7473bd228fcee8d4a3e4525be11 which
> helped me getting meaningful text which internally involves creating the
> regions within the document and then fetching the text from the regions.
> But other than text, I also need coordinates for a line and words present
> in that line of row/column. Is there anyway by which I can achieve the
> same?
> Any guidance would be much helpful.
>
> --
> Thanks & Regards
> Kaushlendra Singh
> Email: singh.kaushlendra016@gmail.com
> Phone: +91 8377094564
>

-- 
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK