You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by David Fertig <Da...@navihealth.com> on 2017/12/07 18:24:43 UTC

Questions about PDFMergerUtility

I'm looking into merging multiple PDF files using more realistic memory/disk limits.  For example, when merging 400 1-page files, PdfBox thinks it needs 30G of space.  This is due to the way it segments the cache limits across all the input sources plus the output file, with the output cache limited to the same size as each input file.  I've experimented with 2 easy modifications and one more involved modifications.

  1.  Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end)
  2.  Better: Split the cache in ½, give ½ to the output file, and ½ to the input file, close each input file after merging.
  3.  Best: Dynamically allocate in 16 page (64K) chucks from memory or disk on demand, release cache as documents are closed after merge.

All these approaches have reduced the memory limit requirements by 1-2 orders of  magnitude.  While I realize this doesn't change the actual memory and disk space used, it allows the limits to be a reasonable expectation of space used during the merge processes.

I have one question.  Both #2 and #3 approaches close the input files right after being merged and have no issues (in limited testing).  Is there a reason the current merge utility keeps all the input files open during the merge and only closes them all at the end?  Closing them after they are merged would save considerable cache space and reduce the need for so many file handles as well.

Thank you,
David
This email, including attachments, may contain information that is privileged, confidential or is exempt from disclosure under applicable law (including, but not limited to, protected health information). It is not intended for transmission to, or receipt by, any unauthorized persons. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you believe this email was sent to you in error, do not read it. Please notify the sender immediately informing them of the error and delete all copies and attachments of the message from your system. Thank you.

Re: Questions about PDFMergerUtility

Posted by Tilman Hausherr <TH...@t-online.de>.
There's a bug in merging:
https://stackoverflow.com/questions/47140209/files-flattened-and-merged-with-pdfbox-are-sharing-common-cosstream
https://issues.apache.org/jira/browse/PDFBOX-3999

If you don't have a structure tree, then you can close it early.

Tilman

Am 07.12.2017 um 19:24 schrieb David Fertig:
> I'm looking into merging multiple PDF files using more realistic memory/disk limits.  For example, when merging 400 1-page files, PdfBox thinks it needs 30G of space.  This is due to the way it segments the cache limits across all the input sources plus the output file, with the output cache limited to the same size as each input file.  I've experimented with 2 easy modifications and one more involved modifications.
>
>    1.  Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end)
>    2.  Better: Split the cache in ½, give ½ to the output file, and ½ to the input file, close each input file after merging.
>    3.  Best: Dynamically allocate in 16 page (64K) chucks from memory or disk on demand, release cache as documents are closed after merge.
>
> All these approaches have reduced the memory limit requirements by 1-2 orders of  magnitude.  While I realize this doesn't change the actual memory and disk space used, it allows the limits to be a reasonable expectation of space used during the merge processes.
>
> I have one question.  Both #2 and #3 approaches close the input files right after being merged and have no issues (in limited testing).  Is there a reason the current merge utility keeps all the input files open during the merge and only closes them all at the end?  Closing them after they are merged would save considerable cache space and reduce the need for so many file handles as well.
>
> Thank you,
> David
> This email, including attachments, may contain information that is privileged, confidential or is exempt from disclosure under applicable law (including, but not limited to, protected health information). It is not intended for transmission to, or receipt by, any unauthorized persons. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you believe this email was sent to you in error, do not read it. Please notify the sender immediately informing them of the error and delete all copies and attachments of the message from your system. Thank you.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org