You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Reshmi Vikraman <re...@gmail.com> on 2022/10/13 08:52:58 UTC

PDFBox Performance Issue - Tagged pdfs

Hello All,

I have been using pdfbox 2.0.27 to generate accessible pdfs. The pdf
contains a table with several 100s of rows and 6 columns and each cell in
the table is added as a marked content.

When the number of rows increase, I noticed a spike in response times.
Profiling the process showed that the main consumer of cpu time was the
invocation to begin marked content at
https://github.com/apache/pdfbox/blob/e72963ca5b283a87828ee731cd85c0b6baf1ff57/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDPageContentStream.java#L2302

Looking into this a bit further, the latency was when the propertylist was
being added to the resources object, where we first check if the property
exists in the map before adding it in. As the number of properties in the
map increases, this is adding to the CPU time.
https://github.com/apache/pdfbox/blob/e72963ca5b283a87828ee731cd85c0b6baf1ff57/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java#L701

I also noticed that 2.0.27 updated the resources object to use a
LinkedHashMap instead of a SmallMap which has greatly improved performance
in this area in our case, however we are still looking to reduce response
times further.

Given that we are adding marked content as table cells, we need a new
property list for every marked content, so in our case it feels like the
check to verify if the property list exists  already is offerring very
little value given the resources it consumes.

To work around this, I was looking at ways of bypassing this check and I
noticed there was a deprecated method
https://github.com/apache/pdfbox/blob/e72963ca5b283a87828ee731cd85c0b6baf1ff57/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDPageContentStream.java#L2287
which allowed us to pass in a COSName and manually add the property list to
the resources tree.

I was wondering what's the best way forward for us in this case:
1. Request the beginMarkedContentSequence method to be un-deprecated. and
expose some methods in PDresources to simplify addition of resources in the
map. or
2. Request a new overloaded beginMarkedContent method that allows us to may
be pass in a boolean flag that can override the check to see if a property
list already exists in the resource tree.

As we work with stringent data controls, I am unable to share the pdf or
profiling details, so if you require any further details then please let me
know. Alternately, I can raise a ticket on the JIRA tracker, but I just
wanted to check here first before doing that.

Many thanks for your help.