You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Matthias Böhm <st...@mail.uni-kiel.de> on 2011/03/23 10:13:26 UTC

Extracting text from tables

Hello everybody!

I've got the following problem with pdfbox 1.5: I want to extract text from
tables in HTML-Mode with PDFText2HTML, but the following problem occurs:
When there's only one line of text in one column, like here:

<col 1>           <col 2>
one line of text  other content 1
                  other content 2

then pdfbox creates only *one* paragraph:

<p>one line of text other content 1
other content 2
</p>

What I want instead is:

<p>one line of text
</p>
<p>other content 1
other content 2
</p>

Is there any way to say the pdf stripper that it shall create a paragraph
for each column separately? 
Note that this problem only occurs if there's *one* line of text in one
column, in the following case everything works as expected:

<col 1>              <col 2>
first line           first line in col 2
second line          second line in col 2

There you get 

<p>first line
second line
</p>
<p>first line in col 2
second line in col 2
</p>

and that's just what I want. 

Greetings, 
Matthias Böhm