You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Matthias Böhm <st...@mail.uni-kiel.de> on 2011/03/23 10:13:26 UTC
Extracting text from tables
Hello everybody!
I've got the following problem with pdfbox 1.5: I want to extract text from
tables in HTML-Mode with PDFText2HTML, but the following problem occurs:
When there's only one line of text in one column, like here:
<col 1> <col 2>
one line of text other content 1
other content 2
then pdfbox creates only *one* paragraph:
<p>one line of text other content 1
other content 2
</p>
What I want instead is:
<p>one line of text
</p>
<p>other content 1
other content 2
</p>
Is there any way to say the pdf stripper that it shall create a paragraph
for each column separately?
Note that this problem only occurs if there's *one* line of text in one
column, in the following case everything works as expected:
<col 1> <col 2>
first line first line in col 2
second line second line in col 2
There you get
<p>first line
second line
</p>
<p>first line in col 2
second line in col 2
</p>
and that's just what I want.
Greetings,
Matthias Böhm