You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Craig Ringer <cr...@postnewspapers.com.au> on 2011/12/11 14:31:45 UTC

Merging multiple different subsets of the same font, or re-embedding font

Hi folks

I'm new here and to pdfbox - I've been looking at it mainly because it's 
used by Apache FOP to embed PDFs in other PDFs during XSL-FO 
typesetting. I'm using it to produce classified advertising pages, and 
I've run into a bit of a roadblock that Google and searching the fop and 
pdfbox mailing lists hasn't helped with.

My documents contain 500-1000 small PDF files embedded as form XObjects 
into the master PDF file. Each of the original files has its fonts 
included as an embedded subset. Since many of the documents use 
different sets of glyphs from the same fonts, and the whole PDF is 
copied into the new document, I land up with hundreds of copies of 
common fonts like "Helvetica Bold (subset)" in the final document. A 
check with Acrobat Pro suggests that over 90% of the document's size is 
embedded fonts.

What I'm looking for is a way to more intelligently merge the documents 
to reduce or eliminate this font duplication. I'd like to:

- Embed a whole, non-subset copy of a font if its available locally, and 
then change all references in documents I'm including as XObjects so 
they refer to the new copy I've embedded (so long as the encodings 
match); or even better

- As each document is embedded as an XObject into the main document, 
build a list of which glyphs its embedded fonts define. Don't import the 
font embeds, instead leave a dangling indirect reference to a font we're 
yet to define. When all documents are embedded, produce and embed a new 
subset using a local copy of the complete font, including only the 
glyphs that're actually used.

Better again would be to extract all the embedded subsets and *combine* 
them, so I wouldn't need a local copy of the font. That's probably way 
too hard, though.

I realise that I can never de-duplicate embedded subsets with different 
encodings. If there's "Helvetica Black" embedded 3 times, once each in 
WinAnsi, MacRoman and a custom encoding, there's no possible reduction 
without re-encoding the content streams, which is WAY beyond what I want 
to tackle. All I'm interested in is improving the case of 100 copies of 
"Helvetica Black (subset)" in WinAnsi, which I want to reduce to one 
slightly bigger embedded subset covering all the same glyphs or failing 
that a complete copy of the font.

Ideas? Is this completely insane, or possibly practical?

The docs for PDFBox offer nearly zero information on its font APIs, so I 
presume I need to go delving directly into the PDF font data structures 
to do any of this. I know the PDF format's low level structure quite 
well, but know nearly nothing about the embedded font formats or their 
encodings, so I'm *really* hoping PDFBox offers some helpers for fonts 
that just aren't referenced in the docs. Any tips?

Is there anything built-in for creating custom font subsets given a 
glyph list? For unembedding fonts?

Anybody tried anything like this already?

Tips/suggestions?

--
Craig Ringer

POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088     Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/

Re: Merging multiple different subsets of the same font, or re-embedding font

Posted by Craig Ringer <ri...@ringerc.id.au>.
On 12/12/2011 04:05 AM, mehdi houshmand wrote:
> Hi Craig,
>
> We're looking into this exact same problem, I'll let you know if
> anything comes of it.
>
That's a handy co-incidence.

When you say "exactly" the same problem - is your work connected with 
Apache FOP output of embedded PDFs via Jeremias's fop-pdf-images 
extension ( http://www.jeremias-maerki.ch/download/fop/pdf-images/) too?

Or do you just mean that you're interested in de-duplicating and merging 
embedded subset fonts in general?

Have you made any progress since you started looking? Any avenues you've 
already ruled out?

What's the context you're interested in this for? Mine is a classified 
pagination application I'm developing in-house for my newspaper employer 
and will be releasing under an open source license (exact license yet to 
be determined) once the kinks are worked out or sooner if anyone's 
interested in it. It can use EPS image resources for PostScript output 
(to PDF via Distiller), but I'd prefer to produce native PDF if I can 
fix this font issue.

I'm going to be looking through the pdfbox and fontbox sources to see 
what sort of font handling code there already is. I'm particularly going 
to be searching for anything that parses and understands embedded fonts, 
as being able to easily determine which glyphs are already defined in an 
embedded subset would be a big help if I want to re-embed with a new 
subset rather than a whole font.

Any info on what you've already done to avoid duplicating work would be 
very handy.

--
Craig Ringer

Re: Merging multiple different subsets of the same font, or re-embedding font

Posted by mehdi houshmand <me...@gmail.com>.
Hi Craig,

We're looking into this exact same problem, I'll let you know if
anything comes of it.

Mehdi

On 11 December 2011 13:31, Craig Ringer <cr...@postnewspapers.com.au> wrote:
> Hi folks
>
> I'm new here and to pdfbox - I've been looking at it mainly because it's
> used by Apache FOP to embed PDFs in other PDFs during XSL-FO typesetting.
> I'm using it to produce classified advertising pages, and I've run into a
> bit of a roadblock that Google and searching the fop and pdfbox mailing
> lists hasn't helped with.
>
> My documents contain 500-1000 small PDF files embedded as form XObjects into
> the master PDF file. Each of the original files has its fonts included as an
> embedded subset. Since many of the documents use different sets of glyphs
> from the same fonts, and the whole PDF is copied into the new document, I
> land up with hundreds of copies of common fonts like "Helvetica Bold
> (subset)" in the final document. A check with Acrobat Pro suggests that over
> 90% of the document's size is embedded fonts.
>
> What I'm looking for is a way to more intelligently merge the documents to
> reduce or eliminate this font duplication. I'd like to:
>
> - Embed a whole, non-subset copy of a font if its available locally, and
> then change all references in documents I'm including as XObjects so they
> refer to the new copy I've embedded (so long as the encodings match); or
> even better
>
> - As each document is embedded as an XObject into the main document, build a
> list of which glyphs its embedded fonts define. Don't import the font
> embeds, instead leave a dangling indirect reference to a font we're yet to
> define. When all documents are embedded, produce and embed a new subset
> using a local copy of the complete font, including only the glyphs that're
> actually used.
>
> Better again would be to extract all the embedded subsets and *combine*
> them, so I wouldn't need a local copy of the font. That's probably way too
> hard, though.
>
> I realise that I can never de-duplicate embedded subsets with different
> encodings. If there's "Helvetica Black" embedded 3 times, once each in
> WinAnsi, MacRoman and a custom encoding, there's no possible reduction
> without re-encoding the content streams, which is WAY beyond what I want to
> tackle. All I'm interested in is improving the case of 100 copies of
> "Helvetica Black (subset)" in WinAnsi, which I want to reduce to one
> slightly bigger embedded subset covering all the same glyphs or failing that
> a complete copy of the font.
>
> Ideas? Is this completely insane, or possibly practical?
>
> The docs for PDFBox offer nearly zero information on its font APIs, so I
> presume I need to go delving directly into the PDF font data structures to
> do any of this. I know the PDF format's low level structure quite well, but
> know nearly nothing about the embedded font formats or their encodings, so
> I'm *really* hoping PDFBox offers some helpers for fonts that just aren't
> referenced in the docs. Any tips?
>
> Is there anything built-in for creating custom font subsets given a glyph
> list? For unembedding fonts?
>
> Anybody tried anything like this already?
>
> Tips/suggestions?
>
> --
> Craig Ringer
>
> POST Newspapers
> 276 Onslow Rd, Shenton Park
> Ph: 08 9381 3088     Fax: 08 9388 2258
> ABN: 50 008 917 717
> http://www.postnewspapers.com.au/