You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by Alexios Giotis <al...@gmail.com> on 2010/12/22 17:59:08 UTC

Patch for large memory usage inside o.a.f.r.p.PDFDocumentHandler

Hi fop-dev,

In one of my use cases, I create a PDF file having about 20000 pages from FOP intermediate format. I imagined this as a streaming process (e.g. read a page in FOP_IF, write it to PDF and release memory) with the exception of caching of images. In reality, by analyzing a heap dump taken with the -XX:+HeapDumpOnOutOfMemoryError parameter on a production server, I found out that o.a.f.r.p.PDFDocumentHandler keeps for every page a reference to be used for bookmarks & outlines. In my case, the retained  heap size of every page is about 150kb. If you multiply this with the number of pages, the memory usage is large. Even worse, on my production server I have 10 threads creating 20k page documents in parallel.

Attached is a patch against the latest revision 1051938 of trunk that considerably reduces the memory usage by keeping only a String pdfPageRef instead of the full org.apache.fop.pdf.PDFReference object. This was possible because from the object we only need to get that string.  Ideally, I would like not to keep at all the page references if bookmarks & outlines are not used. Or at least, keep it only for the pages that are indeed referenced. Is this possible ? If so, do you have any hints for this ?

If further optimizations are not possible or complex, then I guess I will just open an issue and attach this patch. I hope you agree with the addition of generics on the Map declaration and with the change of "new Integer()" to "Integer.valueOf())" (findbugs performance warning).


Greetings,
Alexis Giotis


Re: Patch for large memory usage inside o.a.f.r.p.PDFDocumentHandler

Posted by Alexios Giotis <al...@gmail.com>.
Merry Christmas to all !

Simon, thanks for applying my patch and for your hints. The memory usage of that map is now a few MB. Initially it was around 2GB, so it's not worth to make further and more complex changes.

Alexis


On Dec 23, 2010, at 12:34 PM, Simon Pepping wrote:

> I implemented your suggestion in revision 1052214. Thanks.
> 
> In order to omit keeping any reference, you might implement a command
> line option. Alternatively, you might implement scanning the fo tree
> to check if any page references are used. However, this would only
> work if there is only one page sequence in the fo file. There is no
> way to know that during processing, since FOP processes one page
> sequence at a time, without looking forward in the fo file. Other than
> LaTeX, FOP implements a one-time process including page references.
> The price seems to be the use of memory to keep the necessary data if
> any reference would occur.
> 
> Simon
> 
> On Wed, Dec 22, 2010 at 06:59:08PM +0200, Alexios Giotis wrote:
>> Hi fop-dev,
>> 
>> In one of my use cases, I create a PDF file having about 20000 pages from FOP intermediate format. I imagined this as a streaming process (e.g. read a page in FOP_IF, write it to PDF and release memory) with the exception of caching of images. In reality, by analyzing a heap dump taken with the -XX:+HeapDumpOnOutOfMemoryError parameter on a production server, I found out that o.a.f.r.p.PDFDocumentHandler keeps for every page a reference to be used for bookmarks & outlines. In my case, the retained  heap size of every page is about 150kb. If you multiply this with the number of pages, the memory usage is large. Even worse, on my production server I have 10 threads creating 20k page documents in parallel.
>> 
>> Attached is a patch against the latest revision 1051938 of trunk that considerably reduces the memory usage by keeping only a String pdfPageRef instead of the full org.apache.fop.pdf.PDFReference object. This was possible because from the object we only need to get that string.  Ideally, I would like not to keep at all the page references if bookmarks & outlines are not used. Or at least, keep it only for the pages that are indeed referenced. Is this possible ? If so, do you have any hints for this ?
>> 
>> If further optimizations are not possible or complex, then I guess I will just open an issue and attach this patch. I hope you agree with the addition of generics on the Map declaration and with the change of "new Integer()" to "Integer.valueOf())" (findbugs performance warning).


Re: Patch for large memory usage inside o.a.f.r.p.PDFDocumentHandler

Posted by Simon Pepping <sp...@leverkruid.eu>.
I implemented your suggestion in revision 1052214. Thanks.

In order to omit keeping any reference, you might implement a command
line option. Alternatively, you might implement scanning the fo tree
to check if any page references are used. However, this would only
work if there is only one page sequence in the fo file. There is no
way to know that during processing, since FOP processes one page
sequence at a time, without looking forward in the fo file. Other than
LaTeX, FOP implements a one-time process including page references.
The price seems to be the use of memory to keep the necessary data if
any reference would occur.

Simon

On Wed, Dec 22, 2010 at 06:59:08PM +0200, Alexios Giotis wrote:
> Hi fop-dev,
> 
> In one of my use cases, I create a PDF file having about 20000 pages from FOP intermediate format. I imagined this as a streaming process (e.g. read a page in FOP_IF, write it to PDF and release memory) with the exception of caching of images. In reality, by analyzing a heap dump taken with the -XX:+HeapDumpOnOutOfMemoryError parameter on a production server, I found out that o.a.f.r.p.PDFDocumentHandler keeps for every page a reference to be used for bookmarks & outlines. In my case, the retained  heap size of every page is about 150kb. If you multiply this with the number of pages, the memory usage is large. Even worse, on my production server I have 10 threads creating 20k page documents in parallel.
> 
> Attached is a patch against the latest revision 1051938 of trunk that considerably reduces the memory usage by keeping only a String pdfPageRef instead of the full org.apache.fop.pdf.PDFReference object. This was possible because from the object we only need to get that string.  Ideally, I would like not to keep at all the page references if bookmarks & outlines are not used. Or at least, keep it only for the pages that are indeed referenced. Is this possible ? If so, do you have any hints for this ?
> 
> If further optimizations are not possible or complex, then I guess I will just open an issue and attach this patch. I hope you agree with the addition of generics on the Map declaration and with the change of "new Integer()" to "Integer.valueOf())" (findbugs performance warning).