You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Bogdan Kostic <bo...@web.de> on 2020/11/17 17:42:42 UTC
Getting font style and size out of PDFs
Hello,
I am using tika to extract text out of pdf documents. I want to write a heuristic to differentiate between headings and paragraphs. For this, I need font style and size of the extracted text. Is there any way to get font style and size using tika? I was not able to find an option to extract this information.
Thank you in advance!
Re: Getting font style and size out of PDFs
Posted by Tim Allison <ta...@apache.org>.
There isn't currently a way to do this in Tika, but it _should_ be possible
to add. I think there's been some interest in this over the years, but
there hasn't been enough momentum to add this to Tika.
@Tilman this should be doable, right?
On Tue, Nov 17, 2020 at 12:42 PM Bogdan Kostic <bo...@web.de> wrote:
> Hello,
>
> I am using tika to extract text out of pdf documents. I want to write a
> heuristic to differentiate between headings and paragraphs. For this, I
> need font style and size of the extracted text. Is there any way to get
> font style and size using tika? I was not able to find an option to extract
> this information.
>
> Thank you in advance!