You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Stephan Thesing <th...@gmx.de> on 2010/01/13 21:27:18 UTC

FOP and large documents: out of memory

Hello,

as is well-known, FOP can run out of heap memory, when large documents
are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory).

I have the situation that the documents I have to process mandate a footer on each page that contains a "page X of Y" element and a TOC at the
beginning of the document, i.e. FOP cannot layout the pages until all
referenced page-citations are known, which is after the last page of the document.

When page content is quite complicated (e.g. 2000 pages mostly full with tables), the heap space does not suffice to hold all pages until all references can be resolved, thus FOP aborts with out-of-memory.

Since increasing the heap space does not always work (3 GB heap space was required in one example), I need a better solution for this.

1. "-conserve" option
One alternative would be the "-conserve" option, which serializes the pages to disk and reloads them as needed.
Although slow, this definitely would be a solution, if it worked, which it doesn't:
Our documents include graphics (SVG, PNG), and the serialization with "-conserve" throws an exception, because some class in Batik is not serializable (e.g. "SVGOMAnimatedString" IIRR), thus the page is missing, causing FOP to abort later.
Thus, Batik would have to be fixed for this.

2. Two passes
Since the pages are kept because of unresolved references, one could do the
same as e.g. LaTeX always did: process the document twice.
In a first run, pages are discarded after layout, only the references for page-citations are kept and at the end reused for the second pass
(when all pages for the citations are finally known).
For the second run, these id-refs are initially loaded and no pages have
to be kept.
This would require more changes in FOP (and should definitely be made optional obviously).

I would appreciate any comments or other suggestions !

Best regards
Stephan
--
Dr.-Ing. Stephan Thesing
Elektrastr. 50
81925 München
GERMANY

Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

Re: FOP and large documents: out of memory

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Stephan,

I’m not sure I would invest any energy into improving the
CachedRenderPagesModel (-conserve option). It doesn’t look like the
right approach to me, and like you noticed it doesn’t even work out of
the box currently.

Why store the Area Tree on disk? Why not directly render it into the
final output format? If that latter supports out-of-order pages, then
that’s great; Otherwise we may as well store the final pages and order
them later on when the document is complete, instead of storing them in
a half-finished area tree format.

As to pages that hold unresolved references, so can’t obviously be
rendered yet: there usually aren’t that many of them that would make the
area tree solution vastly superior to a final format one in term of
memory consumption. Those ones could be kept in memory until all the
references they hold are resolved.

Also, the handling of forward references is currently less than optimal.
The resolution is made in the area tree instead of looping back to the
layout engine. ATM, a page-reference is rendered using a placeholder
string (‘MMM’), and that placeholder is later replaced with the actual
value (e.g., ‘5’). This is fine for constructs like tables of content,
but may produce ugly results if the page-number-citation is inside
a paragraph, ruining the even spacing. What’s the point of implementing
a high-quality line-breaking algorithm if its output is spoiled by
a poor handling of page citations?

I think the two-pass approach is the best long-term solution, although
obviously less trivial. One challenge is to detect a possible infinite
loop. For example: referenced item is at the beginning of page IX,
reference is updated to IX, which takes less room than MMM, so the
document is re-laid out and referenced item is moved to page VIII;
Reference must be updated again, document is laid out again and
referenced item end up on page IX again. And again, and again...

One possible workaround for your use case is to generate your document
once with a dummy TOC and just “Page X” into the intermediate format;
Parse it to get the total number of pages and the page numbers for each
element of the TOC; Re-generate it with hardcoded values for page
references.

HTH,
Vincent

Stephan Thesing wrote:
> Hello,
> 
> as is well-known, FOP can run out of heap memory, when large documents
> are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory).
> 
> I have the situation that the documents I have to process mandate a footer on each page that contains a "page X of Y" element and a TOC at the
> beginning of the document, i.e. FOP cannot layout the pages until all
> referenced page-citations are known, which is after the last page of the document.
> 
> When page content is quite complicated (e.g. 2000 pages mostly full with tables), the heap space does not suffice to hold all pages until all references can be resolved, thus FOP aborts with out-of-memory.
> 
> Since increasing the heap space does not always work (3 GB heap space was required in one example), I need a better solution for this.
> 
> 1. "-conserve" option
> One alternative would be the "-conserve" option, which serializes the pages to disk and reloads them as needed.
> Although slow, this definitely would be a solution, if it worked, which it doesn't:
>  Our documents include graphics (SVG, PNG), and the serialization with "-conserve" throws an exception, because some class in Batik is not serializable (e.g. "SVGOMAnimatedString" IIRR), thus the page is missing, causing FOP to abort later.
> Thus, Batik would have to be fixed for this.
> 
> 2. Two passes
> Since the pages are kept because of unresolved references, one could do the
> same as e.g. LaTeX always did: process the document twice.
> In a first run, pages are discarded after layout, only the references for page-citations are kept and at the end reused for the second pass
> (when all pages for the citations are finally known).
> For the second run, these id-refs are initially loaded and no pages have
> to be kept.
> This would require more changes in FOP (and should definitely be made optional obviously).
> 
> 
> 
> I would appreciate any comments or other suggestions !
> 
> 
> Best regards
>   Stephan

Re: FOP and large documents: out of memory

Posted by Andreas Delmelle <an...@telenet.be>.

On 13 Jan 2010, at 22:37, Stephan Thesing wrote:

>> 
>> On 13 Jan 2010, at 21:27, Stephan Thesing wrote:
> ...
>>> Our documents include graphics (SVG, PNG), and the serialization with
>> "-conserve" throws an exception, because some class in Batik is not
>> serializable (e.g. "SVGOMAnimatedString" IIRR), thus the page is missing, causing
>> FOP to abort later.
>>> Thus, Batik would have to be fixed for this.
>> 
>> I think FOP can be 'fixed' for this too. If that is really the only class
>> that is causing trouble, then FOP could make a serializable subclass for
>> it, and use that in the area tree, instead of Batik's default
>> non-serializable implementation. Unless Batik really needs it, why fix it there?
> 
> I don't think that can work, as that class is used in elements nested in classes of Batik that represent the SVG.
> 
> I.e., FOP never instantiates it, but the Batik code does somewhere along

OK, I see...

Just noticed that my idea for 'subclassing' is probably not entirely what I meant...
Suppose, for the sake of the argument, that String is not serializable, but we'd need it for some reason and the Java vendor does not want to alter their implementation. What could be done, is store only the info needed to create a new String upon deserialization. Serialize the char-array, and re-instantiate the String instead.

I was thinking something similar should be possible here, but if it is really that far out of FOP's control, then never mind.

Regards

Andreas

Andreas Delmelle
mailto:andreas.delmelle.AT.telenet.be
---

Re: FOP and large documents: out of memory

Posted by Stephan Thesing <th...@gmx.de>.

Hi Andreas,

-------- Original-Nachricht --------
> Datum: Wed, 13 Jan 2010 21:42:51 +0100
> Von: Andreas Delmelle <an...@telenet.be>
> An: fop-dev@xmlgraphics.apache.org
> Betreff: Re: FOP and large documents: out of memory

> 
> On 13 Jan 2010, at 21:27, Stephan Thesing wrote:
...
> > Our documents include graphics (SVG, PNG), and the serialization with
> "-conserve" throws an exception, because some class in Batik is not
> serializable (e.g. "SVGOMAnimatedString" IIRR), thus the page is missing, causing
> FOP to abort later.
> > Thus, Batik would have to be fixed for this.
> 
> I think FOP can be 'fixed' for this too. If that is really the only class
> that is causing trouble, then FOP could make a serializable subclass for
> it, and use that in the area tree, instead of Batik's default
> non-serializable implementation. Unless Batik really needs it, why fix it there?

I don't think that can work, as that class is used in elements nested in classes of Batik that represent the SVG.

I.e., FOP never instantiates it, but the Batik code does somewhere along
the way of creating the SVG element that is actually used in the Area tree....
(I am not sure, if it is the only class that cannot be serialized, as the serialization is aborted as soon as the first non-serializable class is encountered.)

Best regards
   Stephan

-- 
Dr.-Ing. Stephan Thesing
Elektrastr. 50
81925 München
GERMANY

Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

Re: FOP and large documents: out of memory

Posted by Andreas Delmelle <an...@telenet.be>.

On 13 Jan 2010, at 21:27, Stephan Thesing wrote:

Hi Stephan,

<snip />
> Since increasing the heap space does not always work (3 GB heap space was required in one example), I need a better solution for this.
> 
> 1. "-conserve" option
> One alternative would be the "-conserve" option, which serializes the pages to disk and reloads them as needed.
> Although slow, this definitely would be a solution, if it worked, which it doesn't:
> Our documents include graphics (SVG, PNG), and the serialization with "-conserve" throws an exception, because some class in Batik is not serializable (e.g. "SVGOMAnimatedString" IIRR), thus the page is missing, causing FOP to abort later.
> Thus, Batik would have to be fixed for this.

I think FOP can be 'fixed' for this too. If that is really the only class that is causing trouble, then FOP could make a serializable subclass for it, and use that in the area tree, instead of Batik's default non-serializable implementation. Unless Batik really needs it, why fix it there?

It would require some thought on a (de)serialization routine, though... But seems much easier/faster to implement than the two-pass approach, if time/effort is of the essence.

Regards,

Andreas
mailto:andreas.delmelle.AT.telenet.be

---