You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Stahle, Patrick" <pa...@te.com> on 2016/03/18 20:01:56 UTC

Strange performance problem with certain PDF files

Hi all,

I am running into a lot of strange performance issues with certain PDF files.

Background info:
The strange thing I can't reproduce this consistently. When I get a pdf being generated on a particular environment it seems consistent. I do most of my development inside VirtualBox virtual machine running fedora. These pdf files I am having problems with never have performance issues when run on my virtual machine local drive, but if I use a Virtual Box Shared drive as the source / destination for the PDF, I see the problem. Another co-worker working from pure windows environment experience the performance problem. We are also seeing the same issue on our dev solaris servers. The performance range can be quite drastic on one of our 3DPDF's (12meg) running on my local environment it can be opened, stamped with some text, encrypted, and saved in around 8 sec. Doing the same job pointing to a virtual box share drive or on our solaris server that same work will take minutes. On my coworkers windows environment it takes around 30 seconds. We really only reproduced this consistently on the 12m 3D  PDF. I have a much smaller pdf (non 3d / convert from msoffice) that does show similar performance issue but the times range from 200ms local to 8 sec.

The one thing I see in common between the 2 files is I see a lot of the following messages to the console:
Using output from the 12m 3DPDF file:
:
:
1787 [main] DEBUG org.apache.pdfbox.pdfparser.PDFObjectStreamParser  - parsed=COSObject{13166, 0}

These messages seem to happen on the PDDocument.open and from what I can tell, I get 13,166 of these messages in this example PDF.
The slowness does not happen until the following line:
document.save(outputPDFStream);

Other PDF's including some quite large I do not see this performance issue nor those log messages.

I know this is not much to go on, I am working on seeing if I can isolate this down to something more concrete / reproducible point. But I thought I would send this out to see if anyone has any ideas or have seen issues similar to this? Suggestions?

Thanks,
Patrick


RE: Strange performance problem with certain PDF files

Posted by "Stahle, Patrick" <pa...@te.com>.
Thanks Tillman, that makes sense...

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, March 22, 2016 4:38 PM
To: users@pdfbox.apache.org
Subject: Re: Strange performance problem with certain PDF files

     public void save(File file) throws IOException
     {
         save(new BufferedOutputStream(new FileOutputStream(file)));
     }

so it is more efficient than

save(OutputStream output)

which just takes what it gets. See also 
https://issues.apache.org/jira/browse/PDFBOX-3121

Tilman


Am 21.03.2016 um 20:58 schrieb Stahle, Patrick:
> Hi John / Tillman,
>
> I have reduced it down to be a difference between doing a PDDocument.save() using FileOutputStream. If I pass in Java File instead, the problem does not occur. Also we have only been able to reproduce it on some larger pdf files. It also seems to only happen in certain environments. On my linux virtual machine I have not been able to reproduce it at all. Windows and Solaris Server (3par drive cluster). I have some simple sample code that reproduces the problem but the 2 pdf files I have at hand I don't think I can send you. The one is a 3D PDF of ours (TE Classified) and the other ironically is IText v1 manual in pdf form. The times are pretty drastic, on Windows the 3D PDF with using Java File class is about 3 seconds vs.  29 seconds for the FileOutputStream. IText manual is not as bad at 2 vs. 20.
>
> Anyways, we have a workaround. We just converted our code to pass Java File class for use by PDFBox. If I can find a suitable PDF that reproduces the problem I will send it your way.
>
> Thanks,
> Patrick
>
> -----Original Message-----
> From: John Hewson [mailto:john@jahewson.com]
> Sent: Friday, March 18, 2016 4:45 PM
> To: users@pdfbox.apache.org
> Subject: Re: Strange performance problem with certain PDF files
>
>
>> On 18 Mar 2016, at 12:01, Stahle, Patrick <pa...@te.com> wrote:
>>
>> Hi all,
>>
>> I am running into a lot of strange performance issues with certain PDF files.
>>
>> Background info:
>> The strange thing I can't reproduce this consistently. When I get a pdf being generated on a particular environment it seems consistent. I do most of my development inside VirtualBox virtual machine running fedora. These pdf files I am having problems with never have performance issues when run on my virtual machine local drive, but if I use a Virtual Box Shared drive as the source / destination for the PDF, I see the problem. Another co-worker working from pure windows environment experience the performance problem. We are also seeing the same issue on our dev solaris servers. The performance range can be quite drastic on one of our 3DPDF's (12meg) running on my local environment it can be opened, stamped with some text, encrypted, and saved in around 8 sec. Doing the same job pointing to a virtual box share drive or on our solaris server that same work will take minutes. On my coworkers windows environment it takes around 30 seconds. We really only reproduced this consistently on the 12m 3D  PDF. I have a much smaller pdf (non 3d / convert from msoffice) that does show similar performance issue but the times range from 200ms local to 8 sec.
> You need to isolate the problem, you’ve got too many variables to make any sense of it all. Get a reproducible problem on one, non-virtualised JVM first.
>
> — John
>
>> The one thing I see in common between the 2 files is I see a lot of the following messages to the console:
>> Using output from the 12m 3DPDF file:
>> :
>> :
>> 1787 [main] DEBUG org.apache.pdfbox.pdfparser.PDFObjectStreamParser  - parsed=COSObject{13166, 0}
>>
>> These messages seem to happen on the PDDocument.open and from what I can tell, I get 13,166 of these messages in this example PDF.
>> The slowness does not happen until the following line:
>> document.save(outputPDFStream);
>>
>> Other PDF's including some quite large I do not see this performance issue nor those log messages.
>>
>> I know this is not much to go on, I am working on seeing if I can isolate this down to something more concrete / reproducible point. But I thought I would send this out to see if anyone has any ideas or have seen issues similar to this? Suggestions?
>>
>> Thanks,
>> Patrick
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Strange performance problem with certain PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.
     public void save(File file) throws IOException
     {
         save(new BufferedOutputStream(new FileOutputStream(file)));
     }

so it is more efficient than

save(OutputStream output)

which just takes what it gets. See also 
https://issues.apache.org/jira/browse/PDFBOX-3121

Tilman


Am 21.03.2016 um 20:58 schrieb Stahle, Patrick:
> Hi John / Tillman,
>
> I have reduced it down to be a difference between doing a PDDocument.save() using FileOutputStream. If I pass in Java File instead, the problem does not occur. Also we have only been able to reproduce it on some larger pdf files. It also seems to only happen in certain environments. On my linux virtual machine I have not been able to reproduce it at all. Windows and Solaris Server (3par drive cluster). I have some simple sample code that reproduces the problem but the 2 pdf files I have at hand I don't think I can send you. The one is a 3D PDF of ours (TE Classified) and the other ironically is IText v1 manual in pdf form. The times are pretty drastic, on Windows the 3D PDF with using Java File class is about 3 seconds vs.  29 seconds for the FileOutputStream. IText manual is not as bad at 2 vs. 20.
>
> Anyways, we have a workaround. We just converted our code to pass Java File class for use by PDFBox. If I can find a suitable PDF that reproduces the problem I will send it your way.
>
> Thanks,
> Patrick
>
> -----Original Message-----
> From: John Hewson [mailto:john@jahewson.com]
> Sent: Friday, March 18, 2016 4:45 PM
> To: users@pdfbox.apache.org
> Subject: Re: Strange performance problem with certain PDF files
>
>
>> On 18 Mar 2016, at 12:01, Stahle, Patrick <pa...@te.com> wrote:
>>
>> Hi all,
>>
>> I am running into a lot of strange performance issues with certain PDF files.
>>
>> Background info:
>> The strange thing I can't reproduce this consistently. When I get a pdf being generated on a particular environment it seems consistent. I do most of my development inside VirtualBox virtual machine running fedora. These pdf files I am having problems with never have performance issues when run on my virtual machine local drive, but if I use a Virtual Box Shared drive as the source / destination for the PDF, I see the problem. Another co-worker working from pure windows environment experience the performance problem. We are also seeing the same issue on our dev solaris servers. The performance range can be quite drastic on one of our 3DPDF's (12meg) running on my local environment it can be opened, stamped with some text, encrypted, and saved in around 8 sec. Doing the same job pointing to a virtual box share drive or on our solaris server that same work will take minutes. On my coworkers windows environment it takes around 30 seconds. We really only reproduced this consistently on the 12m 3D  PDF. I have a much smaller pdf (non 3d / convert from msoffice) that does show similar performance issue but the times range from 200ms local to 8 sec.
> You need to isolate the problem, you’ve got too many variables to make any sense of it all. Get a reproducible problem on one, non-virtualised JVM first.
>
> — John
>
>> The one thing I see in common between the 2 files is I see a lot of the following messages to the console:
>> Using output from the 12m 3DPDF file:
>> :
>> :
>> 1787 [main] DEBUG org.apache.pdfbox.pdfparser.PDFObjectStreamParser  - parsed=COSObject{13166, 0}
>>
>> These messages seem to happen on the PDDocument.open and from what I can tell, I get 13,166 of these messages in this example PDF.
>> The slowness does not happen until the following line:
>> document.save(outputPDFStream);
>>
>> Other PDF's including some quite large I do not see this performance issue nor those log messages.
>>
>> I know this is not much to go on, I am working on seeing if I can isolate this down to something more concrete / reproducible point. But I thought I would send this out to see if anyone has any ideas or have seen issues similar to this? Suggestions?
>>
>> Thanks,
>> Patrick
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


RE: Strange performance problem with certain PDF files

Posted by "Stahle, Patrick" <pa...@te.com>.
Hi John / Tillman,

I have reduced it down to be a difference between doing a PDDocument.save() using FileOutputStream. If I pass in Java File instead, the problem does not occur. Also we have only been able to reproduce it on some larger pdf files. It also seems to only happen in certain environments. On my linux virtual machine I have not been able to reproduce it at all. Windows and Solaris Server (3par drive cluster). I have some simple sample code that reproduces the problem but the 2 pdf files I have at hand I don't think I can send you. The one is a 3D PDF of ours (TE Classified) and the other ironically is IText v1 manual in pdf form. The times are pretty drastic, on Windows the 3D PDF with using Java File class is about 3 seconds vs.  29 seconds for the FileOutputStream. IText manual is not as bad at 2 vs. 20. 

Anyways, we have a workaround. We just converted our code to pass Java File class for use by PDFBox. If I can find a suitable PDF that reproduces the problem I will send it your way.

Thanks,
Patrick

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Friday, March 18, 2016 4:45 PM
To: users@pdfbox.apache.org
Subject: Re: Strange performance problem with certain PDF files


> On 18 Mar 2016, at 12:01, Stahle, Patrick <pa...@te.com> wrote:
> 
> Hi all,
> 
> I am running into a lot of strange performance issues with certain PDF files.
> 
> Background info:
> The strange thing I can't reproduce this consistently. When I get a pdf being generated on a particular environment it seems consistent. I do most of my development inside VirtualBox virtual machine running fedora. These pdf files I am having problems with never have performance issues when run on my virtual machine local drive, but if I use a Virtual Box Shared drive as the source / destination for the PDF, I see the problem. Another co-worker working from pure windows environment experience the performance problem. We are also seeing the same issue on our dev solaris servers. The performance range can be quite drastic on one of our 3DPDF's (12meg) running on my local environment it can be opened, stamped with some text, encrypted, and saved in around 8 sec. Doing the same job pointing to a virtual box share drive or on our solaris server that same work will take minutes. On my coworkers windows environment it takes around 30 seconds. We really only reproduced this consistently on the 12m 3D  PDF. I have a much smaller pdf (non 3d / convert from msoffice) that does show similar performance issue but the times range from 200ms local to 8 sec.

You need to isolate the problem, you’ve got too many variables to make any sense of it all. Get a reproducible problem on one, non-virtualised JVM first.

— John

> The one thing I see in common between the 2 files is I see a lot of the following messages to the console:
> Using output from the 12m 3DPDF file:
> :
> :
> 1787 [main] DEBUG org.apache.pdfbox.pdfparser.PDFObjectStreamParser  - parsed=COSObject{13166, 0}
> 
> These messages seem to happen on the PDDocument.open and from what I can tell, I get 13,166 of these messages in this example PDF.
> The slowness does not happen until the following line:
> document.save(outputPDFStream);
> 
> Other PDF's including some quite large I do not see this performance issue nor those log messages.
> 
> I know this is not much to go on, I am working on seeing if I can isolate this down to something more concrete / reproducible point. But I thought I would send this out to see if anyone has any ideas or have seen issues similar to this? Suggestions?
> 
> Thanks,
> Patrick
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Strange performance problem with certain PDF files

Posted by John Hewson <jo...@jahewson.com>.
> On 18 Mar 2016, at 12:01, Stahle, Patrick <pa...@te.com> wrote:
> 
> Hi all,
> 
> I am running into a lot of strange performance issues with certain PDF files.
> 
> Background info:
> The strange thing I can't reproduce this consistently. When I get a pdf being generated on a particular environment it seems consistent. I do most of my development inside VirtualBox virtual machine running fedora. These pdf files I am having problems with never have performance issues when run on my virtual machine local drive, but if I use a Virtual Box Shared drive as the source / destination for the PDF, I see the problem. Another co-worker working from pure windows environment experience the performance problem. We are also seeing the same issue on our dev solaris servers. The performance range can be quite drastic on one of our 3DPDF's (12meg) running on my local environment it can be opened, stamped with some text, encrypted, and saved in around 8 sec. Doing the same job pointing to a virtual box share drive or on our solaris server that same work will take minutes. On my coworkers windows environment it takes around 30 seconds. We really only reproduced this consistently on the 12m 3D  PDF. I have a much smaller pdf (non 3d / convert from msoffice) that does show similar performance issue but the times range from 200ms local to 8 sec.

You need to isolate the problem, you’ve got too many variables to make any sense of it all. Get a reproducible problem on one, non-virtualised JVM first.

— John

> The one thing I see in common between the 2 files is I see a lot of the following messages to the console:
> Using output from the 12m 3DPDF file:
> :
> :
> 1787 [main] DEBUG org.apache.pdfbox.pdfparser.PDFObjectStreamParser  - parsed=COSObject{13166, 0}
> 
> These messages seem to happen on the PDDocument.open and from what I can tell, I get 13,166 of these messages in this example PDF.
> The slowness does not happen until the following line:
> document.save(outputPDFStream);
> 
> Other PDF's including some quite large I do not see this performance issue nor those log messages.
> 
> I know this is not much to go on, I am working on seeing if I can isolate this down to something more concrete / reproducible point. But I thought I would send this out to see if anyone has any ideas or have seen issues similar to this? Suggestions?
> 
> Thanks,
> Patrick
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org