You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by spud <sp...@gmail.com> on 2009/11/04 13:51:02 UTC

Extracting formatted text

A few months ago I was trying to extract formatted text from a pdf,
and output in a structured format (ideally xml/html). The text
attributes I required to be available for each line of text were:

- Paragraph (ie relative location on page)
- Font
- Font size
- Font weight

I tried to do this with PDFBox at the time but was unable to. I posted
to the mailing list and was told this functionality was not available
yet, and I would have to implement it myself. I didn't have the time
(and possibly the ability) to do this, so I went with a commercial
tool.

Has PDFBox now moved on enough for it to be able to do the above out
of the box (no pun intended!)?

Thanks.

Re: Extracting formatted text

Posted by Shen Wang <fe...@gmail.com>.
Hi Spud:

At least not for me. It seems that the TextPosition object has relative
method to let you work on the font and such a object is said be
available from the PDFTextStripper class. However, I tried both writing
code and reading through the source code and double convinced that
things are not working out that way. And I posted my questions on that
here, but so far nobody can give an answer.

Best,

Felix


spud wrote:
> A few months ago I was trying to extract formatted text from a pdf,
> and output in a structured format (ideally xml/html). The text
> attributes I required to be available for each line of text were:
>
> - Paragraph (ie relative location on page)
> - Font
> - Font size
> - Font weight
>
> I tried to do this with PDFBox at the time but was unable to. I posted
> to the mailing list and was told this functionality was not available
> yet, and I would have to implement it myself. I didn't have the time
> (and possibly the ability) to do this, so I went with a commercial
> tool.
>
> Has PDFBox now moved on enough for it to be able to do the above out
> of the box (no pun intended!)?
>
> Thanks.
>