You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Craig Ringer <cr...@postnewspapers.com.au> on 2011/12/11 14:31:45 UTC
Merging multiple different subsets of the same font, or re-embedding
font
Hi folks
I'm new here and to pdfbox - I've been looking at it mainly because it's
used by Apache FOP to embed PDFs in other PDFs during XSL-FO
typesetting. I'm using it to produce classified advertising pages, and
I've run into a bit of a roadblock that Google and searching the fop and
pdfbox mailing lists hasn't helped with.
My documents contain 500-1000 small PDF files embedded as form XObjects
into the master PDF file. Each of the original files has its fonts
included as an embedded subset. Since many of the documents use
different sets of glyphs from the same fonts, and the whole PDF is
copied into the new document, I land up with hundreds of copies of
common fonts like "Helvetica Bold (subset)" in the final document. A
check with Acrobat Pro suggests that over 90% of the document's size is
embedded fonts.
What I'm looking for is a way to more intelligently merge the documents
to reduce or eliminate this font duplication. I'd like to:
- Embed a whole, non-subset copy of a font if its available locally, and
then change all references in documents I'm including as XObjects so
they refer to the new copy I've embedded (so long as the encodings
match); or even better
- As each document is embedded as an XObject into the main document,
build a list of which glyphs its embedded fonts define. Don't import the
font embeds, instead leave a dangling indirect reference to a font we're
yet to define. When all documents are embedded, produce and embed a new
subset using a local copy of the complete font, including only the
glyphs that're actually used.
Better again would be to extract all the embedded subsets and *combine*
them, so I wouldn't need a local copy of the font. That's probably way
too hard, though.
I realise that I can never de-duplicate embedded subsets with different
encodings. If there's "Helvetica Black" embedded 3 times, once each in
WinAnsi, MacRoman and a custom encoding, there's no possible reduction
without re-encoding the content streams, which is WAY beyond what I want
to tackle. All I'm interested in is improving the case of 100 copies of
"Helvetica Black (subset)" in WinAnsi, which I want to reduce to one
slightly bigger embedded subset covering all the same glyphs or failing
that a complete copy of the font.
Ideas? Is this completely insane, or possibly practical?
The docs for PDFBox offer nearly zero information on its font APIs, so I
presume I need to go delving directly into the PDF font data structures
to do any of this. I know the PDF format's low level structure quite
well, but know nearly nothing about the embedded font formats or their
encodings, so I'm *really* hoping PDFBox offers some helpers for fonts
that just aren't referenced in the docs. Any tips?
Is there anything built-in for creating custom font subsets given a
glyph list? For unembedding fonts?
Anybody tried anything like this already?
Tips/suggestions?
--
Craig Ringer
POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088 Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/
Re: Merging multiple different subsets of the same font, or re-embedding
font
Posted by Craig Ringer <ri...@ringerc.id.au>.
On 12/12/2011 04:05 AM, mehdi houshmand wrote:
> Hi Craig,
>
> We're looking into this exact same problem, I'll let you know if
> anything comes of it.
>
That's a handy co-incidence.
When you say "exactly" the same problem - is your work connected with
Apache FOP output of embedded PDFs via Jeremias's fop-pdf-images
extension ( http://www.jeremias-maerki.ch/download/fop/pdf-images/) too?
Or do you just mean that you're interested in de-duplicating and merging
embedded subset fonts in general?
Have you made any progress since you started looking? Any avenues you've
already ruled out?
What's the context you're interested in this for? Mine is a classified
pagination application I'm developing in-house for my newspaper employer
and will be releasing under an open source license (exact license yet to
be determined) once the kinks are worked out or sooner if anyone's
interested in it. It can use EPS image resources for PostScript output
(to PDF via Distiller), but I'd prefer to produce native PDF if I can
fix this font issue.
I'm going to be looking through the pdfbox and fontbox sources to see
what sort of font handling code there already is. I'm particularly going
to be searching for anything that parses and understands embedded fonts,
as being able to easily determine which glyphs are already defined in an
embedded subset would be a big help if I want to re-embed with a new
subset rather than a whole font.
Any info on what you've already done to avoid duplicating work would be
very handy.
--
Craig Ringer
Re: Merging multiple different subsets of the same font, or
re-embedding font
Posted by mehdi houshmand <me...@gmail.com>.
Hi Craig,
We're looking into this exact same problem, I'll let you know if
anything comes of it.
Mehdi
On 11 December 2011 13:31, Craig Ringer <cr...@postnewspapers.com.au> wrote:
> Hi folks
>
> I'm new here and to pdfbox - I've been looking at it mainly because it's
> used by Apache FOP to embed PDFs in other PDFs during XSL-FO typesetting.
> I'm using it to produce classified advertising pages, and I've run into a
> bit of a roadblock that Google and searching the fop and pdfbox mailing
> lists hasn't helped with.
>
> My documents contain 500-1000 small PDF files embedded as form XObjects into
> the master PDF file. Each of the original files has its fonts included as an
> embedded subset. Since many of the documents use different sets of glyphs
> from the same fonts, and the whole PDF is copied into the new document, I
> land up with hundreds of copies of common fonts like "Helvetica Bold
> (subset)" in the final document. A check with Acrobat Pro suggests that over
> 90% of the document's size is embedded fonts.
>
> What I'm looking for is a way to more intelligently merge the documents to
> reduce or eliminate this font duplication. I'd like to:
>
> - Embed a whole, non-subset copy of a font if its available locally, and
> then change all references in documents I'm including as XObjects so they
> refer to the new copy I've embedded (so long as the encodings match); or
> even better
>
> - As each document is embedded as an XObject into the main document, build a
> list of which glyphs its embedded fonts define. Don't import the font
> embeds, instead leave a dangling indirect reference to a font we're yet to
> define. When all documents are embedded, produce and embed a new subset
> using a local copy of the complete font, including only the glyphs that're
> actually used.
>
> Better again would be to extract all the embedded subsets and *combine*
> them, so I wouldn't need a local copy of the font. That's probably way too
> hard, though.
>
> I realise that I can never de-duplicate embedded subsets with different
> encodings. If there's "Helvetica Black" embedded 3 times, once each in
> WinAnsi, MacRoman and a custom encoding, there's no possible reduction
> without re-encoding the content streams, which is WAY beyond what I want to
> tackle. All I'm interested in is improving the case of 100 copies of
> "Helvetica Black (subset)" in WinAnsi, which I want to reduce to one
> slightly bigger embedded subset covering all the same glyphs or failing that
> a complete copy of the font.
>
> Ideas? Is this completely insane, or possibly practical?
>
> The docs for PDFBox offer nearly zero information on its font APIs, so I
> presume I need to go delving directly into the PDF font data structures to
> do any of this. I know the PDF format's low level structure quite well, but
> know nearly nothing about the embedded font formats or their encodings, so
> I'm *really* hoping PDFBox offers some helpers for fonts that just aren't
> referenced in the docs. Any tips?
>
> Is there anything built-in for creating custom font subsets given a glyph
> list? For unembedding fonts?
>
> Anybody tried anything like this already?
>
> Tips/suggestions?
>
> --
> Craig Ringer
>
> POST Newspapers
> 276 Onslow Rd, Shenton Park
> Ph: 08 9381 3088 Fax: 08 9388 2258
> ABN: 50 008 917 717
> http://www.postnewspapers.com.au/