You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Dane Bezuidenhout <da...@sprinthive.com> on 2017/07/18 13:28:35 UTC

How to logically read text from a PDF table?

The examples available are clear on constructing a table, but there is
little info on reading a table. I've investigated a few solution to this,
but feel that they are "hacky" in that they rely on establishing column and
row regions to read text from.

Surely there is a canonical way to traverse the PDDocument table elements
and access table cells with reference to row and columns?

Any advice would be appreciated.


Dane Bezuidenhout
SprintHive <https://sprinthive.com/>

M: +27 82 562 7850


vCard <http://www.sprinthive.com/files/dane.vcf>

Re: How to logically read text from a PDF table?

Posted by Dane Bezuidenhout <da...@sprinthive.com>.
Hi Manuel,

Thank you for the fast response, I will investigate Tabula.

Regards,

Dane

Dane Bezuidenhout
SprintHive <https://sprinthive.com/>

M: +27 82 562 7850


vCard <http://www.sprinthive.com/files/dane.vcf>

On Tue, Jul 18, 2017 at 5:31 PM, Manuel Aristarán <ma...@jazzido.com>
wrote:

> Hi Dane,
>
> As you might know, there's no thing such as tables in PDF files. The only
> way to extract them is to try to reconstruct the tabular arrangement from
> the characters' positions, ruling lines, and so on. I'm one of the
> maintainers of Tabula [1], which is a tool based on PDFBox that implements
> a number of algorithms to attempt that. We have a GUI tool [2], and a Java
> library [3]. Both are open source (MIT license)
>
> Best,
>
> [1] http://tabula.technology
> [2] https://github.com/tabulapdf/tabula
> [3] https://github.com/tabulapdf/tabula-java
>
> --
> Manuel Aristarán
> jazzido.com
>
>
>
> On Tue, Jul 18, 2017 at 9:28 AM, Dane Bezuidenhout <
> dane.bezuidenhout@sprinthive.com> wrote:
>
> > The examples available are clear on constructing a table, but there is
> > little info on reading a table. I've investigated a few solution to this,
> > but feel that they are "hacky" in that they rely on establishing column
> and
> > row regions to read text from.
> >
> > Surely there is a canonical way to traverse the PDDocument table elements
> > and access table cells with reference to row and columns?
> >
> > Any advice would be appreciated.
> >
> >
> > Dane Bezuidenhout
> > SprintHive <https://sprinthive.com/>
> >
> > M: +27 82 562 7850
> >
> >
> > vCard <http://www.sprinthive.com/files/dane.vcf>
> >
>

Re: How to logically read text from a PDF table?

Posted by Manuel Aristarán <ma...@jazzido.com>.
Hi Dane,

As you might know, there's no thing such as tables in PDF files. The only
way to extract them is to try to reconstruct the tabular arrangement from
the characters' positions, ruling lines, and so on. I'm one of the
maintainers of Tabula [1], which is a tool based on PDFBox that implements
a number of algorithms to attempt that. We have a GUI tool [2], and a Java
library [3]. Both are open source (MIT license)

Best,

[1] http://tabula.technology
[2] https://github.com/tabulapdf/tabula
[3] https://github.com/tabulapdf/tabula-java

--
Manuel Aristarán
jazzido.com



On Tue, Jul 18, 2017 at 9:28 AM, Dane Bezuidenhout <
dane.bezuidenhout@sprinthive.com> wrote:

> The examples available are clear on constructing a table, but there is
> little info on reading a table. I've investigated a few solution to this,
> but feel that they are "hacky" in that they rely on establishing column and
> row regions to read text from.
>
> Surely there is a canonical way to traverse the PDDocument table elements
> and access table cells with reference to row and columns?
>
> Any advice would be appreciated.
>
>
> Dane Bezuidenhout
> SprintHive <https://sprinthive.com/>
>
> M: +27 82 562 7850
>
>
> vCard <http://www.sprinthive.com/files/dane.vcf>
>