You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by André Ramos <an...@gmail.com> on 2009/07/07 19:48:40 UTC

Extracting Features From Text

Hello,

I'd like to use PDFBox to extract text with special features like: bold
text, italicized text, text whose font size is above average and so on. The
idea is that any kind of highlighted text or any text formatted out of the
ordinary within a document must contain relevant terms to describe the
document.

How can I do it?

Thank you.

-- 
Best regards,
André Ramos

Re: Extracting Features From Text

Posted by Robert Pesch <rp...@scai.fraunhofer.de>.
Hello André,

have a look on the PDFTextStripper. It collects tokens from a given 
document (so called TextPositions). A TextPosition object has as a 
method called getFont which returns you the font object encapsulating 
font information for the current token. What you can do, is to retrieve 
the base font name from the font object (the postscript name of the 
font) and check, if its end with the postfix -bold or whatever (this is 
at least what i did to detect bold text blocks). Further a TextPosition 
object contains the attribute fontSize. With this attribute you should 
be able to detect larger text tokens by (just a suggestion) parsing an 
entire page, computing the median font size, parsing the page again and 
checking it the fontSize of a token is above the median.

I hope i could help you.

With kind regards,
Robert



André Ramos schrieb:
> Hello,
>
> I'd like to use PDFBox to extract text with special features like: bold
> text, italicized text, text whose font size is above average and so on. The
> idea is that any kind of highlighted text or any text formatted out of the
> ordinary within a document must contain relevant terms to describe the
> document.
>
> How can I do it?
>
> Thank you.
>
>