You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by André Ramos <an...@gmail.com> on 2009/07/07 19:48:40 UTC
Extracting Features From Text
Hello,
I'd like to use PDFBox to extract text with special features like: bold
text, italicized text, text whose font size is above average and so on. The
idea is that any kind of highlighted text or any text formatted out of the
ordinary within a document must contain relevant terms to describe the
document.
How can I do it?
Thank you.
--
Best regards,
André Ramos
Re: Extracting Features From Text
Posted by Robert Pesch <rp...@scai.fraunhofer.de>.
Hello André,
have a look on the PDFTextStripper. It collects tokens from a given
document (so called TextPositions). A TextPosition object has as a
method called getFont which returns you the font object encapsulating
font information for the current token. What you can do, is to retrieve
the base font name from the font object (the postscript name of the
font) and check, if its end with the postfix -bold or whatever (this is
at least what i did to detect bold text blocks). Further a TextPosition
object contains the attribute fontSize. With this attribute you should
be able to detect larger text tokens by (just a suggestion) parsing an
entire page, computing the median font size, parsing the page again and
checking it the fontSize of a token is above the median.
I hope i could help you.
With kind regards,
Robert
André Ramos schrieb:
> Hello,
>
> I'd like to use PDFBox to extract text with special features like: bold
> text, italicized text, text whose font size is above average and so on. The
> idea is that any kind of highlighted text or any text formatted out of the
> ordinary within a document must contain relevant terms to describe the
> document.
>
> How can I do it?
>
> Thank you.
>
>