You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Jörn Haferstroh <ha...@gmx.de> on 2013/12/07 01:01:40 UTC

Merging multiple PDF into HTTP output stream

Hi,

first let me give some credits to the developers of pdfbox for this very 
usable tool. Please continue your work, guys!

I have a web application storing lots of PDF documents in a database. 
For easier bulk download and printing, I am using pdfbox to merge 
multiple PDF documents into one large PDF document for download. The 
destination stream of the merge is the HTTP output stream, so the merged 
PDF data goes directly to the requesting web client.

Today I learned by a "too many open files" error, that pdfbox creates a 
temporary file for each source input stream and keeps it open until the 
end of the merge process (I tried to merge 1025 PDF sources into one PDF 
on a Linux box). Is this behaviour necessary, maybe caused by the PDF 
format? However, I was able to handle it by increasing the open file 
limit of the user.

When does pdfbox write the first bytes into the merge output stream? 
Does it happen during the merge process or after the last source has 
been merged? So, does the requesting web client has to wait for the 
download to start until all sources have been merged or not?

Thanks for information
Joern


Re: Merging multiple PDF into HTTP output stream

Posted by Jörn Haferstroh <ha...@gmx.de>.
Hi Maruan,

the temporary files are not a problem, I just wanted to know if it is 
necessary to keep them open until the merge is finished. Your answer 
implies the need to keep them open, so let it be.

I distilled the code I am using:

private static void downloadMergedPDF(HttpServletResponse response,
     List<InputStream> documentList, String fileName)
         throws IOException, COSVisitorException {

     response.setContentType("application/pdf");
     response.setContentLength(-1);
     response.addHeader("Content-disposition", "attachment; filename=" + 
fileName);
     OutputStream output = response.getOutputStream();

     PDFMergerUtility merger = new PDFMergerUtility();
     for (InputStream document : documentList) {
         merger.addSource(document);
     }
     merger.setDestinationStream(output);
     merger.mergeDocuments();

     output.flush();
     output.close();
}

I still want to know when the merger starts writing bytes to the output 
stream, already during the merge or after the merge has finished? This 
is important for me to estimate the time the user has to wait for the 
download to begin.

Regards
Joern

Am 07.12.2013 09:05, schrieb Maruan Sahyoun:
> Hi Joern,
>
> you could do it completely in memory but at the cost of memory consumption as all files have to be kept until the merge finishes. So from my perspective adjusting the open file limit is a better option.
>
> Maybe you can post a code snippet how you load the files and do the merging. Maybe there is some easy way to improve that.
>
> BR
> Maruan Sahyoun
>
> Am 07.12.2013 um 01:01 schrieb Jörn Haferstroh <ha...@gmx.de>:
>
>> Hi,
>>
>> first let me give some credits to the developers of pdfbox for this very usable tool. Please continue your work, guys!
>>
>> I have a web application storing lots of PDF documents in a database. For easier bulk download and printing, I am using pdfbox to merge multiple PDF documents into one large PDF document for download. The destination stream of the merge is the HTTP output stream, so the merged PDF data goes directly to the requesting web client.
>>
>> Today I learned by a "too many open files" error, that pdfbox creates a temporary file for each source input stream and keeps it open until the end of the merge process (I tried to merge 1025 PDF sources into one PDF on a Linux box). Is this behaviour necessary, maybe caused by the PDF format? However, I was able to handle it by increasing the open file limit of the user.
>>
>> When does pdfbox write the first bytes into the merge output stream? Does it happen during the merge process or after the last source has been merged? So, does the requesting web client has to wait for the download to start until all sources have been merged or not?
>>
>> Thanks for information
>> Joern
>>
>


Re: Merging multiple PDF into HTTP output stream

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Joern,

you could do it completely in memory but at the cost of memory consumption as all files have to be kept until the merge finishes. So from my perspective adjusting the open file limit is a better option.

Maybe you can post a code snippet how you load the files and do the merging. Maybe there is some easy way to improve that.

BR
Maruan Sahyoun

Am 07.12.2013 um 01:01 schrieb Jörn Haferstroh <ha...@gmx.de>:

> Hi,
> 
> first let me give some credits to the developers of pdfbox for this very usable tool. Please continue your work, guys!
> 
> I have a web application storing lots of PDF documents in a database. For easier bulk download and printing, I am using pdfbox to merge multiple PDF documents into one large PDF document for download. The destination stream of the merge is the HTTP output stream, so the merged PDF data goes directly to the requesting web client.
> 
> Today I learned by a "too many open files" error, that pdfbox creates a temporary file for each source input stream and keeps it open until the end of the merge process (I tried to merge 1025 PDF sources into one PDF on a Linux box). Is this behaviour necessary, maybe caused by the PDF format? However, I was able to handle it by increasing the open file limit of the user.
> 
> When does pdfbox write the first bytes into the merge output stream? Does it happen during the merge process or after the last source has been merged? So, does the requesting web client has to wait for the download to start until all sources have been merged or not?
> 
> Thanks for information
> Joern
>