You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ralf Baumert <Ra...@xclinical.com> on 2019/11/06 07:25:15 UTC

Memory issues

Hello list,


I'm trying to render rather large pdf files with pdfBox (current) and 
I'm running into memory issues.

I created the PDDocument with .setupTempFileOnly() and I can see it's 
creating a scratch file.

However it still consumes loads of memory and in the end it crashes with 
an OutOfMemoryException.

The heap dump show loads of COSStream objects.


My question: is this a known bug / limitation ? Is there a workaround ?


Some details:

- Of course I increased xmx, but sooner or later it will run out of memory.

- I'm opening a new PDPageContentStream for each element (like a table 
or a paragraph), is this the correct way to do

things or am I supposed to only have one stream ? (note: I'm using 
boxable, they create a stream for each table)

- I noticed the saveIncremental() method, but it states that this can 
only be used when the pdf has

been read from a file. Now i could try to create the first page, then 
save the file and load it again

to add some pages and then call this method. Is this feasible ?

- The resulting pdf will be about 5GB in size, this is a hard requirement.


Regards,

Ralf


AW: Memory issues

Posted by Ralf Baumert <Ra...@xclinical.com>.
Thanks for the reply, Tilman ! Unfortunately I couldn't try it out yet, but what I did last week was quite promising: only open one ContentStream per page and reuse it (after all it's called "PageContentStream") and set "resetContext" to false. It ran way better after these changes, but I haven't had the time to really test it yet with a full blown pdf.

ciao,
Ralf
________________________________
Von: Tilman Hausherr <TH...@t-online.de>
Gesendet: Mittwoch, 6. November 2019 18:38
An: users@pdfbox.apache.org <us...@pdfbox.apache.org>
Betreff: Re: Memory issues

Am 06.11.2019 um 08:25 schrieb Ralf Baumert:
> Hello list,
>
>
> I'm trying to render rather large pdf files with pdfBox (current) and
> I'm running into memory issues.
>
> I created the PDDocument with .setupTempFileOnly() and I can see it's
> creating a scratch file.
>
> However it still consumes loads of memory and in the end it crashes with
> an OutOfMemoryException.
>
> The heap dump show loads of COSStream objects.
>
>
> My question: is this a known bug / limitation ? Is there a workaround ?
>
>
> Some details:
>
> - Of course I increased xmx, but sooner or later it will run out of memory.
>
> - I'm opening a new PDPageContentStream for each element (like a table
> or a paragraph), is this the correct way to do
>
> things or am I supposed to only have one stream ? (note: I'm using
> boxable, they create a stream for each table)
>
> - I noticed the saveIncremental() method, but it states that this can
> only be used when the pdf has
>
> been read from a file. Now i could try to create the first page, then
> save the file and load it again
>
> to add some pages and then call this method. Is this feasible ?
>
> - The resulting pdf will be about 5GB in size, this is a hard requirement.



saveIncremental() is best for signing, and it still loads stuff into
memory. You could still try to save your file and then load it and add
pages and save normally, to see if that makes things better - I don't
think so.

I suspect that memory usage gets worse if you have many small page
content streams instead of one large, because of the page buffers memory
management. So what you could try is after being finished with a page,
copy the stream back as once.


byte[] ba;
try (InputStream is = page.getContents())
{
     ba = IOUtils.toByteArray(is);
}
PDStream newPDStream = new PDStream(doc);
try (OutputStream os = newPDStream.createOutputStream(COSName.FLATE_DECODE))
{
     os.write(ba);
}
Iterator<PDStream> it = page.getContentStreams();
while (it.hasNext())
{
     PDStream pds = it.next();
     pds.getCOSObject().close();
}

page.setContents(newPDStream);


Please tell whether this improved things, i.e. that you could create
more pages before OOM. (That code can be optimized even more by copying
directly without the byte array)


Other ways to optimize big PDF files: use the font only once, i.e. don't
create a new font object for each page. Same for images, e.g. a company
logo. Create your image object only once.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Memory issues

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 06.11.2019 um 08:25 schrieb Ralf Baumert:
> Hello list,
>
>
> I'm trying to render rather large pdf files with pdfBox (current) and
> I'm running into memory issues.
>
> I created the PDDocument with .setupTempFileOnly() and I can see it's
> creating a scratch file.
>
> However it still consumes loads of memory and in the end it crashes with
> an OutOfMemoryException.
>
> The heap dump show loads of COSStream objects.
>
>
> My question: is this a known bug / limitation ? Is there a workaround ?
>
>
> Some details:
>
> - Of course I increased xmx, but sooner or later it will run out of memory.
>
> - I'm opening a new PDPageContentStream for each element (like a table
> or a paragraph), is this the correct way to do
>
> things or am I supposed to only have one stream ? (note: I'm using
> boxable, they create a stream for each table)
>
> - I noticed the saveIncremental() method, but it states that this can
> only be used when the pdf has
>
> been read from a file. Now i could try to create the first page, then
> save the file and load it again
>
> to add some pages and then call this method. Is this feasible ?
>
> - The resulting pdf will be about 5GB in size, this is a hard requirement.



saveIncremental() is best for signing, and it still loads stuff into 
memory. You could still try to save your file and then load it and add 
pages and save normally, to see if that makes things better - I don't 
think so.

I suspect that memory usage gets worse if you have many small page 
content streams instead of one large, because of the page buffers memory 
management. So what you could try is after being finished with a page, 
copy the stream back as once.


byte[] ba;
try (InputStream is = page.getContents())
{
     ba = IOUtils.toByteArray(is);
}
PDStream newPDStream = new PDStream(doc);
try (OutputStream os = newPDStream.createOutputStream(COSName.FLATE_DECODE))
{
     os.write(ba);
}
Iterator<PDStream> it = page.getContentStreams();
while (it.hasNext())
{
     PDStream pds = it.next();
     pds.getCOSObject().close();
}

page.setContents(newPDStream);


Please tell whether this improved things, i.e. that you could create 
more pages before OOM. (That code can be optimized even more by copying 
directly without the byte array)


Other ways to optimize big PDF files: use the font only once, i.e. don't 
create a new font object for each page. Same for images, e.g. a company 
logo. Create your image object only once.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org