You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Jinder Aujla <ji...@gmail.com> on 2013/04/30 23:25:17 UTC

Extraction of Type 3 fonts

Hi

Apologies if this is the wrong email to use. I am trying to understand if
and how well PDFBox supports extraction of text from a pdf document that
contains type 3 fonts. It's taken a while to understand the reason behind
the apparent failure in parsing.

Before I go further I thought it would be better to ask, in addition I did
find this ticket in JIRA but I wasn't sure if it was still relevant.

https://issues.apache.org/jira/browse/PDFBOX-124

I can use pdftotext it's not completely successful but it does extract to
some degree. Any guidance is greatly appreciated.

Thanks
Jinder

Re: Extraction of Type 3 fonts

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 30.04.2013 23:25, schrieb Jinder Aujla:
> Hi
>
> Apologies if this is the wrong email to use. I am trying to understand if
> and how well PDFBox supports extraction of text from a pdf document that
> contains type 3 fonts. It's taken a while to understand the reason behind
> the apparent failure in parsing.
It depends on the pdf, but most likely those pdfs don't provide a mapping so
that the text of type 3 fonts can't be extracted.

> Before I go further I thought it would be better to ask, in addition I did
> find this ticket in JIRA but I wasn't sure if it was still relevant.
>
> https://issues.apache.org/jira/browse/PDFBOX-124
>
> I can use pdftotext it's not completely successful but it does extract to
> some degree. Any guidance is greatly appreciated.
It is quite easy to determine if the text of a pdf could be extracted or not.
Just perform the adobe test [1]. If adobe can't extract the text, PDFBox won't
be able to do it neither.

> Thanks
> Jinder


BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/userguide/faq.html#no_text_extraction