You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Bogdan Kostic <bo...@web.de> on 2020/11/17 17:42:42 UTC

Getting font style and size out of PDFs

Hello,

I am using tika to extract text out of pdf documents. I want to write a heuristic to differentiate between headings and paragraphs. For this, I need font style and size of the extracted text. Is there any way to get font style and size using tika? I was not able to find an option to extract this information.

Thank you in advance!

Re: Getting font style and size out of PDFs

Posted by Tim Allison <ta...@apache.org>.
There isn't currently a way to do this in Tika, but it _should_ be possible
to add.  I think there's been some interest in this over the years, but
there hasn't been enough momentum to add this to Tika.

@Tilman this should be doable, right?

On Tue, Nov 17, 2020 at 12:42 PM Bogdan Kostic <bo...@web.de> wrote:

> Hello,
>
> I am using tika to extract text out of pdf documents. I want to write a
> heuristic to differentiate between headings and paragraphs. For this, I
> need font style and size of the extracted text. Is there any way to get
> font style and size using tika? I was not able to find an option to extract
> this information.
>
> Thank you in advance!