You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Matthias Pigulla <mp...@webfactory.de> on 2023/02/08 14:59:44 UTC

Overlay performance on huge (?) PDF files

Dear PDFBox users,

I am using PDFBox to place overlays on lots of different input files. In general, this works very well and reliably – thank to everyone who worked for that!

However, there is one class of particularly awful input files, like the one at https://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf.

That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of cells.

When I try to put a single-page PDF as overlay on it with PDFBox 2.0.27, I have to start the JVM with e. g. 8GB of heap memory, and it maxes out a CPU core on my machine for about 6 minutes. The maximum resident set size as reported by `time` is in the range of 2.4 GB. The result file is about four times the size of the input file.

With a snapshot build of 3.0.0, the max RSS seems not to go above 1 GB, but processing is not finished within 15 minutes (when I aborted). Regarding 3.0.0, I have seen the remarks at https://pdfbox.apache.org/3.0/migration.html#reduced-memory-usage, so I thought it might be worth a try. Probably the overlay will end up traversing all pages anyway, so that may not make a big difference.

My questions are:

- Is there anything I can do to make processing of such files faster or more efficient?
- What may be the reasons for the increase in output file size and can I do anything about it?

Thanks!
-mp.


AW: Overlay performance on huge (?) PDF files

Posted by Matthias Pigulla <mp...@webfactory.de>.
Thank you for the helpful answers and sorry for my late reply.

> Are you adding that overlay to every single page of that pdf? Whta is the
> purpose of that overlay? Maybe a rubber stamp is a better approach?

I was not aware of rubber stamps, to be honest. The requirement is to place identical remarks (like “document no longer valid”) on all pages, and it shall not (easily) be possible for users to circumvent those when printing the document. Currently, the necessary overlays are generated in a separate step from FO with Apache FOP, so I have them as PDF files.
I will have to figure out if this is possible with rubber stamps and how (whether?) I could create the necessary input objects from the same source.
> With  regard to 3.0.0 you might have a look at the kind of input source. Have a look […]
… and …
> 3.0.0 creates such compressed object streams by default, so that the result
> size should be similar to the input size
Thank you for those suggestions, I will look into this.
-mp.


Re: Overlay performance on huge (?) PDF files

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 08.02.23 um 15:59 schrieb Matthias Pigulla:
> Dear PDFBox users,
> 
> I am using PDFBox to place overlays on lots of different input files. In general, this works very well and reliably – thank to everyone who worked for that!
> 
> However, there is one class of particularly awful input files, like the one at https://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf.
> 
> That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of cells.
Are you adding that overlay to every single page of that pdf? Whta is the 
purpose of that overlay? Maybe a rubber stamp is a better approach?


> When I try to put a single-page PDF as overlay on it with PDFBox 2.0.27, I have to start the JVM with e. g. 8GB of heap memory, and it maxes out a CPU core on my machine for about 6 minutes. The maximum resident set size as reported by `time` is in the range of 2.4 GB. The result file is about four times the size of the input file.
> 
> With a snapshot build of 3.0.0, the max RSS seems not to go above 1 GB, but processing is not finished within 15 minutes (when I aborted). Regarding 3.0.0, I have seen the remarks at https://pdfbox.apache.org/3.0/migration.html#reduced-memory-usage, so I thought it might be worth a try. Probably the overlay will end up traversing all pages anyway, so that may not make a big difference.
> 
If you are adding the overlay to all pages the parser more or less has to dig 
through the whole pdf.

> My questions are:
> 
> - Is there anything I can do to make processing of such files faster or more efficient?
Maybe it is a better approach to use a rubberstamp instead of an overlay. With 
regard to 3.0.0 you might have a look at the kind of input source. Have a look 
at the different implementations of org.apache.pdfbox.io.RandomAccessRead. The 
migration guide might give you some additional hints about the usage of the 
input source

> - What may be the reasons for the increase in output file size and can I do anything about it?
I guess your input files are using compressed object streams. 2.0.x doesn't 
support the creation of those streams so that those streams are decompressed 
when adding the overlay. 3.0.0 creates such compressed object streams by 
default, so that the result size should be similar to the input size

Andreas
> 
> Thanks!
> -mp.
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: Overlay performance on huge (?) PDF files

Posted by Matthias Pigulla <mp...@webfactory.de>.
> In "show internal structure" mode I can show the tree, but if I click to
> expand "root" nothing happens anymore. It's either frozen or busy.
I have observed similar problems (hangs or even crashes) with other software when trying to analyze or optimize these documents. I guess this backs the hypothesis that something in the process of how these documents are being generated (a lot of people and tools involved) causes some kind of internal file corruption.
Do you have any suggestions what to look out for?
Thank you!
-mp.

Re: Overlay performance on huge (?) PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.
On 08.02.2023 15:59, Matthias Pigulla wrote:
> However, there is one class of particularly awful input files, like the one athttps://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf.
>
> That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of cells.

According to the log of PDFDebugger It takes 30 seconds to load the PDF 
but the real time is longer, probably the time to create the tree in 
PDFDebugger.

In "show internal structure" mode I can show the tree, but if I click to 
expand "root" nothing happens anymore. It's either frozen or busy.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org