You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Alina Babenko <ha...@ukr.net> on 2017/05/14 11:52:58 UTC

Tabular Data Extracting

Hello, my name is Alina, I'm a student from Ukraine. I'm working on my student project trying to get data from a table using C#. I've used your PDFBox 2.0 but it allowed me to take data just as a text without sells bonding. Is it possible to get data from specific sells of the table using your PDFBox 2.0? Do you have any examples? Could you help me to solve this problem?  
I would really appreciate your help. 
Thank you so much. 

-- 
A. Babenko

Re: Tabular Data Extracting

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

We have recently developed a table extraction tool, norma [1], which is
based on PDFBox outputting intermediate SVG files.

Dense, gridded rectangular tables are relatively easy to extract. But many
tables are not actually simple rectangular tables - they are a messy
product of tree-structured columns and nested subtables (e.g. APS table
format). We are developing tools [2] to create semantic tables based on
Hadley Wickham's "Tidy Data" [3].


[1] https://github.com/ContentMine/norma
[2] https://github.com/ContentMine/cm-ucl
[3] http://vita.had.co.nz/papers/tidy-data.html
[3]

On Sun, May 14, 2017 at 1:15 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 14.05.2017 um 13:52 schrieb Alina Babenko:
>
>> Hello, my name is Alina, I'm a student from Ukraine. I'm working on my
>> student project trying to get data from a table using C#. I've used your
>> PDFBox 2.0 but it allowed me to take data just as a text without sells
>> bonding. Is it possible to get data from specific sells of the table using
>> your PDFBox 2.0? Do you have any examples? Could you help me to solve this
>> problem?
>> I would really appreciate your help.
>> Thank you so much.
>>
>>
> Hello Alina,
>
> PDFBox does not support tabular data extraction because PDF doesn't
> support it (except in tagged PDFs, which don't appear often). You may want
> to have a look at tabula.
>
> http://tabula.technology/
>
> You can use PDFBox if you know the positions in advance, then search in
> the source code examples for ExtractTextByArea.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Tabular Data Extracting

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 14.05.2017 um 13:52 schrieb Alina Babenko:
> Hello, my name is Alina, I'm a student from Ukraine. I'm working on my student project trying to get data from a table using C#. I've used your PDFBox 2.0 but it allowed me to take data just as a text without sells bonding. Is it possible to get data from specific sells of the table using your PDFBox 2.0? Do you have any examples? Could you help me to solve this problem?
> I would really appreciate your help.
> Thank you so much.
>

Hello Alina,

PDFBox does not support tabular data extraction because PDF doesn't 
support it (except in tagged PDFs, which don't appear often). You may 
want to have a look at tabula.

http://tabula.technology/

You can use PDFBox if you know the positions in advance, then search in 
the source code examples for ExtractTextByArea.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org