You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Cornelis Hoeflake <c....@postex.com> on 2016/03/10 21:20:04 UTC

Fontmapper for non render operations

Hi,

When we use for example PdfTextStripperByArea, is it required in that case
to have all non embedded fonts? Could we use a default (fallback) font in
stead of providing the correct fonts? Now we have a global (ThreadLocal)
font provider which is used for rendering and tasks like position based
text extraction. But skipping the font provider for text based text
extraction would simplify our code.

Kind regards,
Cornelis

Re: Fontmapper for non render operations

Posted by Cornelis Hoeflake <c....@postex.com>.
2016-03-10 21:52 GMT+01:00 John Hewson <jo...@jahewson.com>:

>
> > On 10 Mar 2016, at 12:20, Cornelis Hoeflake <c....@postex.com>
> wrote:
> >
> > Hi,
> >
> > When we use for example PdfTextStripperByArea, is it required in that
> case
> > to have all non embedded fonts? Could we use a default (fallback) font in
> > stead of providing the correct fonts? Now we have a global (ThreadLocal)
> > font provider which is used for rendering and tasks like position based
> > text extraction. But skipping the font provider for text based text
> > extraction would simplify our code.
>
> All fonts used in a PDF are supposed to embed their widths, whether or not
> the
> font file itself gets embedded. However, sometimes they don’t, and then you
> need the missing font to get an accurate text extraction. But for
> well-formed
> PDFs you don’t need the fonts.
>

Thanks. Just to be sure we will continue using the font provider for text
extraction purposes.


> — John
>
> > Kind regards,
> > Cornelis
>
>

Re: Fontmapper for non render operations

Posted by John Hewson <jo...@jahewson.com>.
> On 10 Mar 2016, at 12:20, Cornelis Hoeflake <c....@postex.com> wrote:
> 
> Hi,
> 
> When we use for example PdfTextStripperByArea, is it required in that case
> to have all non embedded fonts? Could we use a default (fallback) font in
> stead of providing the correct fonts? Now we have a global (ThreadLocal)
> font provider which is used for rendering and tasks like position based
> text extraction. But skipping the font provider for text based text
> extraction would simplify our code.

All fonts used in a PDF are supposed to embed their widths, whether or not the
font file itself gets embedded. However, sometimes they don’t, and then you
need the missing font to get an accurate text extraction. But for well-formed
PDFs you don’t need the fonts.

— John

> Kind regards,
> Cornelis


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Fontmapper for non render operations

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 10.03.2016 um 21:20 schrieb Cornelis Hoeflake:
> Hi,
>
> When we use for example PdfTextStripperByArea, is it required in that case
> to have all non embedded fonts? Could we use a default (fallback) font in
> stead of providing the correct fonts? Now we have a global (ThreadLocal)
> font provider which is used for rendering and tasks like position based
> text extraction. But skipping the font provider for text based text
> extraction would simplify our code.

That is what is done a lot (using replacement fonts) in the 1.8 
versions, which is why some files are not extracted correctly, due to 
size differences. It results in characters appearing at wrong positions 
due to different "avancement" values.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org