You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Thorsten Schöning <ts...@am-soft.de> on 2017/11/08 11:38:18 UTC

Large memory footprint and long processing time for one page PDF

Hi all,

I'm seeing a strange printing behaviour using Apache PDFBox and a PDF
containing only one page. When printing a completely different PDF
containing a lot more pages and text I don't see that behaviour.

The problem with that one special PDF is that I'm not allowed to share
it publicly, so I would like to 1. know if you think this is a problem
worth looking at and 2. if someone is able to receive my PDF and
handle it reasonable private, like has been suggested for other bugs
already[1]. I don't need some NDA or such, the file should just be
deleted after it's most likely not needed anymore. The content is not
even that sensitive to be afraid of.

The problem is that printing the file using PDFBox 2.0.3 results in
the Java process consuming around 3 GB of memory and processing time
is around 55 seconds. Using the newest PDFBox 2.0.8 instead, memory
consumption drops a bit to around 2,7 GB and processing time is around
35 seconds. Printing other PDFs with e.g. 10 pages of text processing
time is around 3 seconds and memory footprint is about 215 MB.

Printing the problematic PDF with other applications like PDFPrint[2],
there's no problem at all, even if that app is configured to render an
image to print as well. Processing time is around 2 seconds, memory
footprint is maybe 60 MB. So in the end, I simply find the numbers for
PDFBox and that special PDF unexpected high.

The PDF is created automatically from some RTF template in a process
in which some app adds pieces of information to the RTF template file
and converts that to PDF using some arbitrary PDF printer in Windows.
The printing application is MS Word 2010 or such, shouldn't care much.
The PDF looks and opens OK in Adobe Reader, SumatraPDF and whatever
and can be printed from there manually without the high numbers PDFBox
is giving as well.

The command line used to print is the following:

> java -jar "C:\Users\[...]\pdfprint.jar" PrintPDF -silentPrint "C:\Users\[...]\0001-print5B7A1242.pdf"

I don't think that the problem is related to the version of Java used,
because I recognized that behaviour almost a year ago with different
java as well already:

> C:\Users\[...]>java -version
> java version "1.8.0_152"
> Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
> Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)

So, is there any interest to have a more detailed look at the PDF?
Should I file a bug instead?

Thanks!

[1]: https://issues.apache.org/jira/browse/PDFBOX-3729?focusedCommentId=15945755&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15945755
[2]: http://www.verypdf.com/app/pdf-print-cmd/

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Large memory footprint and long processing time for one page PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Yes, I see it now, PrintPDF uses PDFPageable, which creates a 
PDFPrintable, which uses Scaling.ACTUAL_SIZE, and doesn't have a 
parameter for that. Maybe that should be expanded with an extra 
parameter and also allowed as a parameter in the command line utility too.

Tilman

Am 09.11.2017 um 11:31 schrieb Thorsten Schöning:
>
>>> BTW 2, it seems to me that the command line app is printing using
>>> Scaling.ACTUAL_SIZE while in your bug SHRINK_TO_SIZE is preferred.
> [...]
>> Sorry I don't understand.
> There's the following sentence in one of the comments in your provided
> bug report:
>
>> ACTUAL_SIZE usually clips unless you have a printer that has a
>> printable area equal to the full page size of the PDF (i.e. to print
>> A4 you'll need an A4+ printer). So yes, SHRINK_TO_SIZE is usually
>> what you want.
> So I would have expected that printing at the command line would to
> default to SHRINK_TO_SIZE as well. But looking at the code I have the
> feeling that this is not the case.
>
> Just wanted to mention it, might not be a problem and my print results
> are perfectly fine, not missing any content or such.
>
> Mit freundlichen Grüßen,
>
> Thorsten Schöning
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Large memory footprint and long processing time for one page PDF

Posted by Thorsten Schöning <ts...@am-soft.de>.

Guten Tag Tilman Hausherr,
am Mittwoch, 8. November 2017 um 20:12 schrieben Sie:

> Well you could also have entered it without parameters and get the usage.

Simply didn't think of that when I had the docs in front of me. :-)

>> BTW 2, it seems to me that the command line app is printing using
>> Scaling.ACTUAL_SIZE while in your bug SHRINK_TO_SIZE is preferred.
[...]
> Sorry I don't understand.

There's the following sentence in one of the comments in your provided
bug report:

> ACTUAL_SIZE usually clips unless you have a printer that has a
> printable area equal to the full page size of the PDF (i.e. to print
> A4 you'll need an A4+ printer). So yes, SHRINK_TO_SIZE is usually
> what you want.

So I would have expected that printing at the command line would to
default to SHRINK_TO_SIZE as well. But looking at the code I have the
feeling that this is not the case.

Just wanted to mention it, might not be a problem and my print results
are perfectly fine, not missing any content or such.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Large memory footprint and long processing time for one page PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.11.2017 um 19:25 schrieb Thorsten Schöning:
> Guten Tag Tilman Hausherr,
> am Mittwoch, 8. November 2017 um 17:26 schrieben Sie:
>
>> This is a known problem sometimes and there's a workaround (use a fixed dpi)
>> https://issues.apache.org/jira/browse/PDFBOX-3046
> Thanks, that was exactly what I was seeing. Printing using the "-dpi"
> argument with "300" on the shell dropped memory footprint to around
> 400 MB and processed the job in around 2 seconds.
>
>> java -jar "pdfprint.jar" PrintPDF -silentPrint "C:\[...]\0001-print5B7A1242.pdf" -dpi "300"
> Is there an easy way I could detect those PDFs before actually
> printing them to be able to provide "-dpi" only if necessary? Memory

Sadly, no.

> footprint without "-dpi 300" where not necessary is better for my test
> doc with 10 pages, around 225 to 600 MB. Seems to scale by page and my
> customer sometimes needs to print docs with 50 and more pages. Not
> that important, though, as most of his users has at least 8 GB of
> RAM.
>
> BTW, it seems that not all supported args by the command line app are
> documented yet. I only found out because I had a look at the source.
>
> https://pdfbox.apache.org/2.0/commandline.html#printpdf
> https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/PrintPDF.java#L38

Well you could also have entered it without parameters and get the usage.


@Maruan can you add it?

-border Print with border
-dpi Render into intermediate image with specific dpi and then print"



>
> BTW 2, it seems to me that the command line app is printing using
> Scaling.ACTUAL_SIZE while in your bug SHRINK_TO_SIZE is preferred.
>
> https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/PrintPDF.java#L162
> https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/printing/PDFPageable.java#L165
> https://issues.apache.org/jira/browse/PDFBOX-3046?focusedCommentId=14980954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14980954

Sorry I don't understand.

>
>> Did you set the option mentioned here?
>> https://pdfbox.apache.org/2.0/getting-started.html
> Tried that as well just to be sure, but didn't change a thing. A
> combination of those and "-dpi" seems to even be a bit slower than
> "-dpi" alone, something around half a second. So "-dpi" seems to be
> the proper solution for me as well.

The next version will have it... you'll see the differences if CMYK or 
icc colorspaces are used.

Tilman


>
> Mit freundlichen Grüßen,
>
> Thorsten Schöning
>

Re: Large memory footprint and long processing time for one page PDF

Posted by Thorsten Schöning <ts...@am-soft.de>.

Guten Tag Tilman Hausherr,
am Mittwoch, 8. November 2017 um 17:26 schrieben Sie:

> This is a known problem sometimes and there's a workaround (use a fixed dpi)
> https://issues.apache.org/jira/browse/PDFBOX-3046

Thanks, that was exactly what I was seeing. Printing using the "-dpi"
argument with "300" on the shell dropped memory footprint to around
400 MB and processed the job in around 2 seconds.

> java -jar "pdfprint.jar" PrintPDF -silentPrint "C:\[...]\0001-print5B7A1242.pdf" -dpi "300"

Is there an easy way I could detect those PDFs before actually
printing them to be able to provide "-dpi" only if necessary? Memory
footprint without "-dpi 300" where not necessary is better for my test
doc with 10 pages, around 225 to 600 MB. Seems to scale by page and my
customer sometimes needs to print docs with 50 and more pages. Not
that important, though, as most of his users has at least 8 GB of
RAM.

BTW, it seems that not all supported args by the command line app are
documented yet. I only found out because I had a look at the source.

https://pdfbox.apache.org/2.0/commandline.html#printpdf
https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/PrintPDF.java#L38

BTW 2, it seems to me that the command line app is printing using
Scaling.ACTUAL_SIZE while in your bug SHRINK_TO_SIZE is preferred.

https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/PrintPDF.java#L162
https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/printing/PDFPageable.java#L165
https://issues.apache.org/jira/browse/PDFBOX-3046?focusedCommentId=14980954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14980954

> Did you set the option mentioned here?
> https://pdfbox.apache.org/2.0/getting-started.html

Tried that as well just to be sure, but didn't change a thing. A
combination of those and "-dpi" seems to even be a bit slower than
"-dpi" alone, something around half a second. So "-dpi" seems to be
the proper solution for me as well.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Large memory footprint and long processing time for one page PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Hello Thorsten,

This is a known problem sometimes and there's a workaround (use a fixed dpi)
https://issues.apache.org/jira/browse/PDFBOX-3046

Did you set the option mentioned here?
https://pdfbox.apache.org/2.0/getting-started.html

You can send me the file to   tilman  at  snafu  dot  de   and I'll 
treat it with the confidentiality you want and delete it when done or 
when you want it.

Tilman


Am 08.11.2017 um 12:38 schrieb Thorsten Schöning:
> Hi all,
>
> I'm seeing a strange printing behaviour using Apache PDFBox and a PDF
> containing only one page. When printing a completely different PDF
> containing a lot more pages and text I don't see that behaviour.
>
> The problem with that one special PDF is that I'm not allowed to share
> it publicly, so I would like to 1. know if you think this is a problem
> worth looking at and 2. if someone is able to receive my PDF and
> handle it reasonable private, like has been suggested for other bugs
> already[1]. I don't need some NDA or such, the file should just be
> deleted after it's most likely not needed anymore. The content is not
> even that sensitive to be afraid of.
>
> The problem is that printing the file using PDFBox 2.0.3 results in
> the Java process consuming around 3 GB of memory and processing time
> is around 55 seconds. Using the newest PDFBox 2.0.8 instead, memory
> consumption drops a bit to around 2,7 GB and processing time is around
> 35 seconds. Printing other PDFs with e.g. 10 pages of text processing
> time is around 3 seconds and memory footprint is about 215 MB.
>
> Printing the problematic PDF with other applications like PDFPrint[2],
> there's no problem at all, even if that app is configured to render an
> image to print as well. Processing time is around 2 seconds, memory
> footprint is maybe 60 MB. So in the end, I simply find the numbers for
> PDFBox and that special PDF unexpected high.
>
> The PDF is created automatically from some RTF template in a process
> in which some app adds pieces of information to the RTF template file
> and converts that to PDF using some arbitrary PDF printer in Windows.
> The printing application is MS Word 2010 or such, shouldn't care much.
> The PDF looks and opens OK in Adobe Reader, SumatraPDF and whatever
> and can be printed from there manually without the high numbers PDFBox
> is giving as well.
>
> The command line used to print is the following:
>
>> java -jar "C:\Users\[...]\pdfprint.jar" PrintPDF -silentPrint "C:\Users\[...]\0001-print5B7A1242.pdf"
> I don't think that the problem is related to the version of Java used,
> because I recognized that behaviour almost a year ago with different
> java as well already:
>
>> C:\Users\[...]>java -version
>> java version "1.8.0_152"
>> Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
> So, is there any interest to have a more detailed look at the PDF?
> Should I file a bug instead?
>
> Thanks!
>
> [1]: https://issues.apache.org/jira/browse/PDFBOX-3729?focusedCommentId=15945755&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15945755
> [2]: http://www.verypdf.com/app/pdf-print-cmd/
>
> Mit freundlichen Grüßen,
>
> Thorsten Schöning
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org