You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by John Lussmyer <Co...@CasaDelGato.com> on 2022/01/06 17:26:34 UTC

memory requirements when merging PDF files?

I have a need to merge a couple thousand PDF's into one humongous PDF.
The old tool we use for PDF manipulation runs out of memory as it builds the result PDF in memory, and only writes it out when done.

Can PDFBox do something more like streaming the output as it's built?  or even not load all the source pdf content streams until needed for output?


Re: memory requirements when merging PDF files?

Posted by Gilad Denneboom <gi...@gmail.com>.
If you're using the PDFMergerUtility class you can specify a
MemoryUsageSetting parameter when calling mergedDocuments. Use its
setupTempFileOnly method to create a temporary file instead of doing it
in-memory.

On Thu, Jan 6, 2022 at 6:27 PM John Lussmyer <Co...@casadelgato.com> wrote:

> I have a need to merge a couple thousand PDF's into one humongous PDF.
> The old tool we use for PDF manipulation runs out of memory as it builds
> the result PDF in memory, and only writes it out when done.
>
> Can PDFBox do something more like streaming the output as it's built?  or
> even not load all the source pdf content streams until needed for output?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

Re: memory requirements when merging PDF files?

Posted by Tilman Hausherr <TH...@t-online.de>.
Yes that's a good idea. However it will be much slower, and the objects 
will still be in memory, only the stream contents (e.g. images, fonts, 
content streams) will be on disk.

Tilman

Am 07.01.2022 um 17:55 schrieb Kevin Day:
> If you use the temporary file memory storage, it should be possible to work
> with very large files.
>
> https://stackoverflow.com/questions/11301818/pdfbox-working-with-very-large-pdfs/38859566
>
> This isn't streaming (pdf is not really amenable to streaming like you are
> asking), but the disk based scratch memory should get you what you need.
>
> On Fri, Jan 7, 2022, 12:18 AM Tilman Hausherr <TH...@t-online.de> wrote:
>
>> Am 06.01.2022 um 18:26 schrieb John Lussmyer:
>>> I have a need to merge a couple thousand PDF's into one humongous PDF.
>>> The old tool we use for PDF manipulation runs out of memory as it builds
>> the result PDF in memory, and only writes it out when done.
>>> Can PDFBox do something more like streaming the output as it's built?
>> or even not load all the source pdf content streams until needed for output?
>>
>>
>> No + Yes, so you'll also run out of memory at some time.
>>
>> If the huge job is for printing, then remove the structure tree from
>> each file, which is obviously not needed (it is for screen readers). You
>> should save somewhere and reload so that these are no longer in memory.
>>
>> Tilman
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: memory requirements when merging PDF files?

Posted by John Lussmyer <Co...@CasaDelGato.com>.
On Fri Jan 07 08:55:38 PST 2022 kevin@trumpetinc.com said:
>If you use the temporary file memory storage, it should be possible to work
>with very large files.

Thanks, I was hoping there was some way to deal with this case.

I just ran a quick test, generating a 2000 page PDF by placing a 1 page PDF on each output page.
Using  LayerUtility & PDFFormXObject as the real usage will involve placing multiple small PDFs on a large page, for many large pages.
The 1 page PDF was 291K, the resulting 2000 page pdf was 168MB.
(I was doing gc() just before reporting the usage.)
Doing it all in memory:
	7m 38s, and peaked at 424MB in use. 
with the setTempFileOnly on the output document:
	7m 1s, 292MB.



--

Try my Sensible Email package!  https://sourceforge.net/projects/sensibleemail/

Re: memory requirements when merging PDF files?

Posted by Kevin Day <ke...@trumpetinc.com>.
If you use the temporary file memory storage, it should be possible to work
with very large files.

https://stackoverflow.com/questions/11301818/pdfbox-working-with-very-large-pdfs/38859566

This isn't streaming (pdf is not really amenable to streaming like you are
asking), but the disk based scratch memory should get you what you need.

On Fri, Jan 7, 2022, 12:18 AM Tilman Hausherr <TH...@t-online.de> wrote:

> Am 06.01.2022 um 18:26 schrieb John Lussmyer:
> > I have a need to merge a couple thousand PDF's into one humongous PDF.
> > The old tool we use for PDF manipulation runs out of memory as it builds
> the result PDF in memory, and only writes it out when done.
> >
> > Can PDFBox do something more like streaming the output as it's built?
> or even not load all the source pdf content streams until needed for output?
>
>
> No + Yes, so you'll also run out of memory at some time.
>
> If the huge job is for printing, then remove the structure tree from
> each file, which is obviously not needed (it is for screen readers). You
> should save somewhere and reload so that these are no longer in memory.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: memory requirements when merging PDF files?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 06.01.2022 um 18:26 schrieb John Lussmyer:
> I have a need to merge a couple thousand PDF's into one humongous PDF.
> The old tool we use for PDF manipulation runs out of memory as it builds the result PDF in memory, and only writes it out when done.
>
> Can PDFBox do something more like streaming the output as it's built?  or even not load all the source pdf content streams until needed for output?


No + Yes, so you'll also run out of memory at some time.

If the huge job is for printing, then remove the structure tree from 
each file, which is obviously not needed (it is for screen readers). You 
should save somewhere and reload so that these are no longer in memory.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org