You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by mehdi houshmand <me...@gmail.com> on 2012/03/06 11:49:49 UTC

Fwd: fop-pdf-image and fonts; as requested

Seems I didn't forward this one to fop-dev either... My apologies.


---------- Forwarded message ----------
From: mehdi houshmand <me...@gmail.com>
Date: 29 February 2012 09:41
Subject: Re: fop-pdf-image and fonts; as requested
To: Craig Ringer <ri...@ringerc.id.au>


Hi Craig,

We had this exact same problem the last time you brought this issue to
light and our approach was slightly different. Let me first ask you
the question, are you 100% that fonts are the issue here?

When the pdf-image-plugin is used, ALL pdf-images are imported and
wholesale creating a new XObject Form for each page. Now, this works
perfectly fine for smaller documents, however, it can blow the memory
stack on RIPs for larger docs. The reason being XObjects are treated
as global resources of the PDF, as such, it is possible to create the
XObject and use it multiple times. However, this means that each
XObject and its resources, are being stored in memory on the RIP.

This is different to how a RIP can handle a /Page object. When
printing/rendering a /Page object, the RIP only needs the page's
content stream and any resources it references in memory. Once the
page is rendered, the memory can be cleared. When PDFBox merges docs,
it doesn't use the XObject Form, it does so by appending /Page
objects. This is the solution we came to, just adding a PDFBox merger
to the pipeline.

So with that in mind, what exactly are you trying to do? Why are you
using FOP to merge PDFs? Do you need FOP to do this work? Have you
tried merging PDFs with PDFBox and seeing how that affects the RIP?

I've probably got more questions than answers there, but hopefully we
can get to a solution.

Mehdi

On 29 February 2012 04:09, Craig Ringer <ri...@ringerc.id.au> wrote:
> Hi
>
> As requsted by Mehdi Houshmand I'm elaborating on the issue we've been
> running into with fop-pdf-image. I've asked about aspects of it on the
> list before, but now have a better understanding of what's going on.
>
> Where input pdfs being used as form XObjects contain embedded subset
> fonts, I'm seeing many copies of those fonts being embedded in the
> output document. This creates huge output files with lots of duplicate
> font data, and in a few cases has even crashed the RIP used by my work's
> offset press printer. I think they use a Firey, but struggle to get any
> more info than that out of them.
>
> The issue is that fop-pdf-image copies PDFs into fop output PDFs by
> copying the content stream and resources dictionary verbatim from the
> page being extracted from the input PDF, translating it from PDFBox into
> fop PDF structures in the process. This is extremely reliable, ensuring
> that fop-pdf-image form XObjects don't conflict with / interfere with
> the embedding page or vice versa. Unfortunately it also leads to massive
> duplication of data, including:
>
> - Fonts, both subsets and fully embedded fonts
> - Embedded ICC profiles, if present
> - Images re-used across multiple pages or documents
>
> In the case of images, ICC profiles, and fully embedded fonts it'd
> potentially be relatively easy to coalesce these so that all resources
> dictionaries refer to the same object. It's a little hacky because fop
> doesn't give image plugins any "official" way to store data about a
> rendering run for later reference, but it's easy enough to do by storing
> a WeakHashMap<FOUserAgent,...> associating object type and checksum data
> with a particular rendering run. I haven't implemented coalescing of
> images and profiles because it's not part of my problem space, but it
> shouldn't be too hard.
>
> Unfortunately, the above approach doesn't work for our problem, which is
> duplicated *subset* fonts. There are 20 or 30 copies of Helvetica
> Regular alone in one of our typical runs, with a mixture of MacRoman,
> Custom and WinAnsi encodings. They're drawn from the same two or three
> copies of Helvetica from different sources, but each subset has a
> different (though largely overlapping) glyph set. Fop-pdf-image
> correctly but rather sub-optimally copies each subset and references it
> from the associated Form XObject, creating working output but lots of
> wasted space and duplication. We can't just write the font out the first
> time we see it and adjust all future references to the copy we've
> already written, because unlike with ICC profiles and repeatedly used
> images each copy is different.
>
> I see two possible solutions to this problem. Both have the same
> pre-requisites:
>
> (1) A mechanism for image plugins to keep plugin-specific data
> associated with a specific rendering run. A WeakHashMap<FOUserAgent,...>
> works for this, though it isn't pretty.
>
> (2) Code in the image plugin to record each use of each font and group
> usages up into compatible groups so all font references in the group can
> point to the same font in the output. This code can also collect up
> glyph usage information, producing a map of which glyphs are required by
> one or more content streams.
>
> (3) A way to create a new embedded font in the output, either by
> combining input subsets into a single new subset font object or by
> loading a whole font off the HDD and making a new subset with just the
> required glyphs from it.
>
> (4) Some way to be notified, at minimum, just before the xref table is
> going to be written out, so the new font can be written to the output
> stream. The new font can't be written until we know the last embedded
> PDF has been written out, because a future pdf might add use additional
> glpyhs that must be added to the subset.
>
> (5) [Optional but useful] Smarter font loading where more than just
> (family, weight, slant) 3-tuples are used to match fonts, so I can use
> fop's font loading and cache code to see whether there's a whole font
> available to fop that can be substituted for an embedded subset. For
> example, I might need to match Myriad Pro Ultrabold Italic SemiCond, a
> small caps variant face, or similar with no confusion between different
> condensed/expanded versions of the same face, different specialist
> variants, etc. Right now fop's font matching code simply cannot do that,
> so I can't really create new font subsets as an alternative for (3) and
> have to try to combine subsets from the input instead.
>
>
> I have (1) working and I have a prototype of (2) that dumps font usage
> data for a run including a glyph usage map. I was trying to avoid (3)
> for Base14 fonts by just replacing the Resources reference to the font
> with a base14 font ref, but PDF readers seem to choke on this for
> reasons I haven't yet determined.
>
> (4) is the big problem. I can't do a proper implementation of (3)
> without some way to write the produced font out at the end.
>
> For (4) I'd really appreciate advice from the fop community. I need a
> way for a plugin to hook into output just before the xref table is
> written, so it can write new objects to the pdf stream. The object
> numbers for the fonts to be written out will have been reserved the
> first time that font was seen, I just need to write the data out and
> record the offset for the xref entry. As the data to be written is not
> known until the last embedded form XObject is known to have been
> written, the hook must be before xref write-out.
>
> To resolve (5), the whole FontTriplet assumption must be ripped out of
> the code and replaced with a more flexible representation of font info
> that is at minimum (Family, Weight, Slant, Expand/Condense amount,
> Variant) and probably needs an extensible map of additional matching
> characteristics for future-proofing too. This doesn't look like a fun
> thing to do!
>
>
> Right now, an answer to (4) would give me a chance to progress on
> de-duplicating fonts by attempting to combine subsets. I don't know if
> I'll be able to successfully combine subsets together, but I have more
> of a chance of that than I do making new subsets when I can't match
> fonts reliably enough.
>
> Ideas?
> --
> Craig Ringer
>
>

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 03/08/2012 01:25 PM, mehdi houshmand wrote:
> Haha, well the shortest answer I can give is "kinda".
>
> SVG uses Batik, which in turn uses the AWT font classes. Long story
> short, you have to install the font on the system as well as having it
> in the fop.xconf. There are plenty of discussions on this on the
> mailing lists for you to peruse at your leisure.

Thanks. For my purposes, that's a no, since I need to support any 
embedded font whether or not I have access to a complete copy.

BTW, I've modified PDFBox's Overlay.java application to support 
translation, scaling and rotation of the overlay. Once I've cleaned up 
the resource renaming code I'll be able to plug the approach into 
fop-pdf-image - hopefully with few hassles. fop-pdf-image already 
contains most of the pdfbox-to-fop adapter code required.

This won't help with the duplicate fonts (and thus won't help with file 
size) but it might help with the RIP crashes. Here's hoping.

See the PDFBox-dev mailing list for the Overlay.java patch.

--
Craig Ringer

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by mehdi houshmand <me...@gmail.com>.

Haha, well the shortest answer I can give is "kinda".

SVG uses Batik, which in turn uses the AWT font classes. Long story
short, you have to install the font on the system as well as having it
in the fop.xconf. There are plenty of discussions on this on the
mailing lists for you to peruse at your leisure.

Mehdi


On 8 March 2012 02:17, Craig Ringer <ri...@ringerc.id.au> wrote:
> On 08/03/12 04:12, Vincent Hennebert wrote:
>> Just my 2 cents on a particular detail...
>>
>> On 07/03/12 07:51, Craig Ringer wrote:
>>> On 06/03/12 18:49, mehdi houshmand wrote:
>> <snip/>
>>>> So with that in mind, what exactly are you trying to do? Why are you
>>>> using FOP to merge PDFs?
>>> I'm using FOP to produce documents containing a mixture of automatically
>>> typeset formatted text and graphics. Many of the graphics are PDF
>>> documents, and need to be PDF documents because they contain vector
>>> artwork and text that would lose quality and grow massively in size if
>>> embedded in rasterised form.
>>
>> Is SVG an option for you? That might save you a lot of trouble. Or if
>> not readily available, that might still be less work.
>
> Alas, SVG isn't an option. We have a large body of work already in PDF
> (and EPS) format that we can't easily convert to SVG.
>
> Until I checked just now I didn't know that SVG even supported embedded
> fonts. Does fop actually support that and include embedded SVG fonts in
> output PDF?
>
> --
> Craig Ringer
>

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 08/03/12 04:12, Vincent Hennebert wrote:
> Just my 2 cents on a particular detail...
> 
> On 07/03/12 07:51, Craig Ringer wrote:
>> On 06/03/12 18:49, mehdi houshmand wrote:
> <snip/>
>>> So with that in mind, what exactly are you trying to do? Why are you
>>> using FOP to merge PDFs?
>> I'm using FOP to produce documents containing a mixture of automatically
>> typeset formatted text and graphics. Many of the graphics are PDF
>> documents, and need to be PDF documents because they contain vector
>> artwork and text that would lose quality and grow massively in size if
>> embedded in rasterised form.
> 
> Is SVG an option for you? That might save you a lot of trouble. Or if
> not readily available, that might still be less work.

Alas, SVG isn't an option. We have a large body of work already in PDF
(and EPS) format that we can't easily convert to SVG.

Until I checked just now I didn't know that SVG even supported embedded
fonts. Does fop actually support that and include embedded SVG fonts in
output PDF?

--
Craig Ringer

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by Vincent Hennebert <vh...@gmail.com>.

Just my 2 cents on a particular detail...

On 07/03/12 07:51, Craig Ringer wrote:
> On 06/03/12 18:49, mehdi houshmand wrote:
<snip/>
>> So with that in mind, what exactly are you trying to do? Why are you
>> using FOP to merge PDFs?
> I'm using FOP to produce documents containing a mixture of automatically
> typeset formatted text and graphics. Many of the graphics are PDF
> documents, and need to be PDF documents because they contain vector
> artwork and text that would lose quality and grow massively in size if
> embedded in rasterised form.

Is SVG an option for you? That might save you a lot of trouble. Or if
not readily available, that might still be less work.

<snip/>

Vincent

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 07/03/12 16:35, mehdi houshmand wrote:
>> * Insert the concatenated content streams from the source PDF into the
>> output content stream. They must be surrounded by appropriate graphics
>> state save and restore operators and any necessary scale/position
>> operations to place the content where you want it.
> 
> HA HA!! Incorrect! If you look into the nooks and crannies of the PDF
> spec, you'll see that it's possible to use content stream arrays for
> the /Page content stream.

Sure - that's why I said the content stream(s) had to be concatenated
before insertion, because the input might be an array of content streams.

I was thinking that to get reliable results when overlaying you'd have
to wrap the whole series of drawing operations from the input in state
saving/restoring operations, etc, thus having to concatenate the streams
before wrapping. In retrospect, that's not true; one can just as well
wrap each copied content stream in state save/restore and scale/position
operations.

It might even be possible to get away without a graphics state
save/restore, but I don't think so. IIRC multiple content streams are
treated by the reader as if they were one concatenated stream, so you
still have to save/restore gstate to ensure the inserted stream doesn't
mess up anything after it. I'll have to check this in the PDF ref, though.

> I'll leave exploring that to you, but
> basically it makes overlaying pages much much simpler. In related
> news, PDFBox does just that!! What we did (and it's super hack, but it
> worked) is if there we pages with both PDF-image content and FOP
> generated content, we'd get FOP to generate the content without the
> PDF-image and just overlay the pages. Best of both worlds!! (Though
> the purist in me is very much aggrieved)

Urk, that's horrible! Effective, though, I expect. Presumably you still
have to translate scale and rotate then clip the content stream you're
overlaying, though.

> [snip]
>
> The more you describe your problem, the more it sounds like you need
> to do exactly what we did, but just to be sure, I thought I'd explain
> how we got there. Assumptions are a dangerous thing and I've probably
> made some about your issue too.

Given what you've described I'm inclined to agree that the cause of the
issues is the same. I suspect we're facing the same problem or very
similar problems, in which case my RIP crash issues may not be font
related after all.

I still want to fix the font issues because, rip crash causing or not,
the font subset duplication produces massively bloated PDFs that are
totally unsuitable for online distribution. It's kind of disheartening
to learn that the RIP crash issues are probably something else entirely,
since I thought I at least had to solve only one problem.

As for doing exactly what you did: I'd certainly be very interested in
seeing your PDFBox code for loading the fop-generated PDF, finding the
placeholders, and overlaying the PDF graphics over them. In particular
I'd like to see how you handled scaling/translation/rotation/clipping
when drawing the copied streams, and how you handled state saving and
restoration.

I can see overlaying over placeholders in post-processing as a really
useful interim solution, though eventually I'd like to enhance
fop-pdf-image to do that overlaying directly.

The really frustrating thing is that sometimes using an XObject will be
exactly the right thing to do, because the PDF being embedded actually
appears multiple times in the document. The solution to this links
neatly into the font de-duplication issue: fop image plugins need a way
to store per-render-run information, in this case so they can determine
how often an image occurs in a document during the preload run and make
an appropriate decision about how to embed it. I'm not sure it's even
necessary to have an image plugin api change for this; plugins should be
able to store enough information in a WeakHashMap<FOUserAgent,...> to
figure it out, so I should be able to make fop-image-plugin use form
XObjects only for pdf images referenced multiple times.

--
Craig Ringer

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by mehdi houshmand <me...@gmail.com>.

Hi Craig,

Excellent!!! I think we're making some progress here!

<snip>
> Ugh. A well-designed RIP should be able to load XObject forms on demand
> and free them under memory pressure. After all, an image is also a
> global resource that can be referenced multiple times across different
> pages (an indirect object with a stream), but PDFs with large numbers of
> images don't typically crash RIPs. There's no excuse for lots of small
> indirect objects crashing a RIP, be they images or form xobjects.

The operative word there is "well-designed", but also, I think you're
making a lot of assumptions about how the RIP handles these object. I
don't disagree with your assumptions, but I'm just saying, you don't
know how the RIP handles these objects so you have to be careful.

<snip/>
> The same is technically true of rendering a form XObject. Once you've
> drawn it, you can discard its content stream from memory and discard any
> resources you loaded from its resources dictionary. The trouble is that
> you don't know if you'll just be loading it again for the next page.
> It'd be fairly simple to keep a LRU list of form XObjects so they get
> unloaded if they're not referenced after a few pages are processed and
> there's memory pressure. I won't be too surprised if most RIPs don't do
> this, though.

Yeah, again, assuming the people who designed the code designed it to
be robust and flexible is a dangerous assumption I think.

 <snip>
> If you want to use PDFs as image-like resources within a page (as I do)
> then you can't just append the /Page object from the source PDF. As I
> understand it (I haven't implemented this) it's necessary to:
>
> * Extract the /Page's content stream(s) plus all resources referenced
> * Append the referenced resource(s) to the target page's resource
> dictionary, allocating new object numbers as you copy a resource and
> changing the target of any indirect references to match the new object
> number
> * Insert the concatenated content streams from the source PDF into the
> output content stream. They must be surrounded by appropriate graphics
> state save and restore operators and any necessary scale/position
> operations to place the content where you want it.

HA HA!! Incorrect! If you look into the nooks and crannies of the PDF
spec, you'll see that it's possible to use content stream arrays for
the /Page content stream. I'll leave exploring that to you, but
basically it makes overlaying pages much much simpler. In related
news, PDFBox does just that!! What we did (and it's super hack, but it
worked) is if there we pages with both PDF-image content and FOP
generated content, we'd get FOP to generate the content without the
PDF-image and just overlay the pages. Best of both worlds!! (Though
the purist in me is very much aggrieved)

Ok, so maybe I'll add some transparency as to how we came to some of
these decisions. The client told us that PDFs ~16k pages with with
6-8k XObjects (I *heart* grep) were disproportionally slow and that
fonts were to blame, so obviously that's where we started. I managed
to do some font de-duping of Type1 fonts (seen as FOP doesn't subset
these), it was horrendous, the fidelity was terrible but I was just
experimenting. This made some impact, but not enough. So after some
more experimentation, proving fonts weren't to blame, we had to step
back and look at the problem again. We also, found out that the RIP
times didn't correlate to the size of the document i.e. x pages takes
y time, 2x was taking 10y time (if that makes sense). This made us
think it was a memory issue, some how the RIPs memory was filling up.
A lot of faffing about later, and we got to the conclusions I've
described.

The more you describe your problem, the more it sounds like you need
to do exactly what we did, but just to be sure, I thought I'd explain
how we got there. Assumptions are a dangerous thing and I've probably
made some about your issue too.

Hopefully we can get to some resolution about this soon,

Mehdi

Re: Fwd: fop-pdf-image and fonts; as requested

Posted by Craig Ringer <ri...@ringerc.id.au>.

On 06/03/12 18:49, mehdi houshmand wrote:
> We had this exact same problem the last time you brought this issue to
> light and our approach was slightly different. Let me first ask you
> the question, are you 100% that fonts are the issue here?

I'm never 100% certain of anything. I suspect fonts are the issue, but
it's hard to prove. You're quite right that RIP behavior re form
XObjects could well be the problem; I hadn't realised the extent to
which RIPs might simply assume they're re-used across multiple pages and
never free them from RAM until you pointed it out.

> When the pdf-image-plugin is used, ALL pdf-images are imported and
> wholesale creating a new XObject Form for each page. Now, this works
> perfectly fine for smaller documents, however, it can blow the memory
> stack on RIPs for larger docs. The reason being XObjects are treated
> as global resources of the PDF, as such, it is possible to create the
> XObject and use it multiple times. However, this means that each
> XObject and its resources, are being stored in memory on the RIP.

Ugh. A well-designed RIP should be able to load XObject forms on demand
and free them under memory pressure. After all, an image is also a
global resource that can be referenced multiple times across different
pages (an indirect object with a stream), but PDFs with large numbers of
images don't typically crash RIPs. There's no excuse for lots of small
indirect objects crashing a RIP, be they images or form xobjects.

The actual XObject dictionary may have to stay loaded since form
XObjects are named in a global namespace, but the XObject's resources
dictionary, content stream(s), etc certainly don't have to.
Unfortunately, that doesn't mean real-world RIPs will actually release
those resources under memory pressure just because they can. Since it's
hard to guess whether a form XObject will be referenced over and over or
used only once, this isn't that surprising.

> This is different to how a RIP can handle a /Page object. When
> printing/rendering a /Page object, the RIP only needs the page's
> content stream and any resources it references in memory. Once the
> page is rendered, the memory can be cleared.
The same is technically true of rendering a form XObject. Once you've
drawn it, you can discard its content stream from memory and discard any
resources you loaded from its resources dictionary. The trouble is that
you don't know if you'll just be loading it again for the next page.
It'd be fairly simple to keep a LRU list of form XObjects so they get
unloaded if they're not referenced after a few pages are processed and
there's memory pressure. I won't be too surprised if most RIPs don't do
this, though.

> When PDFBox merges docs,
> it doesn't use the XObject Form, it does so by appending /Page
> objects. This is the solution we came to, just adding a PDFBox merger
> to the pipeline.
If you're merging documents as whole pages, where you're plucking pages
from a source document and putting them unmodified into an output
document, that's entirely practical.

If you want to use PDFs as image-like resources within a page (as I do)
then you can't just append the /Page object from the source PDF. As I
understand it (I haven't implemented this) it's necessary to:

* Extract the /Page's content stream(s) plus all resources referenced
* Append the referenced resource(s) to the target page's resource
dictionary, allocating new object numbers as you copy a resource and
changing the target of any indirect references to match the new object
number
* Insert the concatenated content streams from the source PDF into the
output content stream. They must be surrounded by appropriate graphics
state save and restore operators and any necessary scale/position
operations to place the content where you want it.

It's a *LOT* more complicated to get right than embedding an XObject,
not least because two different source PDFs may have resource dictionary
entries with the same name, forcing you to actually parse and rewrite
the content streams to prevent resource name clashes!

I looked at this approach a while ago in another project and ran
screaming. Form XObjects make sure the placed PDF is self-contained,
getting rid of naming clashes in the resources dictionary and ensuring
it's fairly sane to embed in another page.

Doing the above in fop would be even worse, because FOP has its own PDF
library so everything fop-pdf-image reads from pdfbox must be translated
into FOP pdf structures. Still, most of that is in place in
fop-pdf-image, so it *might* be worth tackling. I'm really hoping it's
not necessary, though, because merging and appending resources dicts and
content streams is *ugly* work. It could be done with a PDFBox
PDFStreamEngine, but it wouldn't be fun.

> So with that in mind, what exactly are you trying to do? Why are you
> using FOP to merge PDFs?
I'm using FOP to produce documents containing a mixture of automatically
typeset formatted text and graphics. Many of the graphics are PDF
documents, and need to be PDF documents because they contain vector
artwork and text that would lose quality and grow massively in size if
embedded in rasterised form.

I'm *NOT* trying to use fop to concatenate PDF pages, to impose PDFs, or
any of that. It'd make very little sense to do that.

> Do you need FOP to do this work?
I either need fop, TeX, or need to write my own document layout system.
The latter would be insane - why implement text justification and flow
algorithms, etc, when it's already well established in fop?

> Have you
> tried merging PDFs with PDFBox and seeing how that affects the RIP?
I haven't, and it's worth a try. It'd produce a document containing many
hundreds of small irregular shaped pages, as each input PDF is quite
small. It'd certainly help confirm whether the issue was XObject form
use, or whether it was font duplication.

--
Craig Ringer