You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Gunnar Brand <Gu...@interface-projects.de> on 2021/06/16 18:23:23 UTC

PDFBox rendering performance issues

Hi.

I am using PDFBox for rendering PDF files into images. There is a certain file I am using as benchmark for any PDF library and PDF Box has some problems with it (please note that almost all 3rd party PDF engines have issues with this file):

https://archive.org/details/AlfaWaffenkatalog1911

Good news: PDFBox renders the file perfectly!
Bad news: It takes forever to do so (first page 16 seconds in PDFDebugger on my machine)
I was asking myself, why is this and I have identified and „fixed“ things and could get the time down to 6 seconds.
I started fixing these issues earlier this year, I can’t work on it all the time. (I noticed PDFBOX-5145 which was a good start but misses some things.)

The problem lies within the optimized nature of this file, it stores the white of the background,  the blackness of the text, an image mask for the text,  as well as drawings separately. This is nothing new, I have a scan of a very old magazine which was optimized from 90 to 9 mb in a similar way (but with slight differences so it loads in a second).

What you have is basically a low res picture of white soup, a low res picture of black soup, a very very high res picture of an single bit image mask (say 10000*10000 pixels) and a bunch of normal res images for drawings.

The diffence to the fast pdf is that the image mask is applied to the black soup image as mask (the fast pdf renders it directly) and that the image mask is stored as JBIG2 instead of CCITTFax.
Since this is happening w/o the final target image resolution in mind, apply mask works on the full 10000*10000 pixels.
(Memory requirements: 12 MB for the bitmask, 100 MB for the 8bit mask – luckily single bit masks get expanded to only 8 bit, anything else turns into RGB -,  400 MB for the picture + one extra 400MB since there is a pointless in between image).

Things seen in apply mask:

  *   Scaling the image to the mask is very very slow if you have a 10x scaling factor for each axis and large target and use bicubic. Billinear should be used somehow in these cases (I used an area enlargement of 16 as threshold but problably also should count in the absolute number of pixels).  This is a major performance gain (as 2 seconds instead of in many more). Nearest neighbor is even faster (no time) but of course not an option.
  *   There is some wasteful image allocation happening (400 MB).
  *   PDFBOX-5145 bulk copy works in a roundabout way that slows it down.
  *   It’s posible to use direct alpha copying, which  is even faster (optional).
  *   Softmask code could use integer math which is twice as fast with neglible error (0.001%) compared to float (this is a bonus optimization)

With this alone I almost shaved of half the time. I also looked at the mask reading part:

  *   from1bit() could be optimized a bit (and also fails to issue a warn and break the loop if subsampling is enabled)
  *   reading the jbig2 image in the JBIG2 library is very slow.


I understand that JBIG2 is way more complex than CCITTFax but carefully investigation showed that of 2 seoncds, 0.5 was used for decoding the image itself (depending on page complexity this number can be lower/higher) and 1.5 for converting the bitmap into a BufferedImage. I optimized that 1.5 seconds away to a few milliseconds.

If you are interested in any of this, I can go and clone the git repo and „implement“ my changes there so you can pull things back into the main repo that might be worth it?

(What I can already say is that it‘s probably not going to be 100% formatting style compliant (no leading tabs is one thing, but the whitespaciness with curly brackets lines and no single line if statements I can’t guarantee)).

Gunnar


Re: PDFBox rendering performance issues

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

We're always interested in optimizations. Please do one thing at a time, 
I have regression tests that render over 1000 PDF files so we find 
problems early and can "blame" them on a specific optimization. Submit 
your changes as .diff / .patch files. You can also do PRs. Re the code 
formatting, I can do this automatically, however it is important that 
your change doesn't modify existing code so we can see what change you made.

Tilman

Am 16.06.2021 um 20:23 schrieb Gunnar Brand:
> Hi.
>
> I am using PDFBox for rendering PDF files into images. There is a certain file I am using as benchmark for any PDF library and PDF Box has some problems with it (please note that almost all 3rd party PDF engines have issues with this file):
>
> https://archive.org/details/AlfaWaffenkatalog1911
>
> Good news: PDFBox renders the file perfectly!
> Bad news: It takes forever to do so (first page 16 seconds in PDFDebugger on my machine)
> I was asking myself, why is this and I have identified and „fixed“ things and could get the time down to 6 seconds.
> I started fixing these issues earlier this year, I can’t work on it all the time. (I noticed PDFBOX-5145 which was a good start but misses some things.)
>
> The problem lies within the optimized nature of this file, it stores the white of the background,  the blackness of the text, an image mask for the text,  as well as drawings separately. This is nothing new, I have a scan of a very old magazine which was optimized from 90 to 9 mb in a similar way (but with slight differences so it loads in a second).
>
> What you have is basically a low res picture of white soup, a low res picture of black soup, a very very high res picture of an single bit image mask (say 10000*10000 pixels) and a bunch of normal res images for drawings.
>
> The diffence to the fast pdf is that the image mask is applied to the black soup image as mask (the fast pdf renders it directly) and that the image mask is stored as JBIG2 instead of CCITTFax.
> Since this is happening w/o the final target image resolution in mind, apply mask works on the full 10000*10000 pixels.
> (Memory requirements: 12 MB for the bitmask, 100 MB for the 8bit mask – luckily single bit masks get expanded to only 8 bit, anything else turns into RGB -,  400 MB for the picture + one extra 400MB since there is a pointless in between image).
>
> Things seen in apply mask:
>
>    *   Scaling the image to the mask is very very slow if you have a 10x scaling factor for each axis and large target and use bicubic. Billinear should be used somehow in these cases (I used an area enlargement of 16 as threshold but problably also should count in the absolute number of pixels).  This is a major performance gain (as 2 seconds instead of in many more). Nearest neighbor is even faster (no time) but of course not an option.
>    *   There is some wasteful image allocation happening (400 MB).
>    *   PDFBOX-5145 bulk copy works in a roundabout way that slows it down.
>    *   It’s posible to use direct alpha copying, which  is even faster (optional).
>    *   Softmask code could use integer math which is twice as fast with neglible error (0.001%) compared to float (this is a bonus optimization)
>
> With this alone I almost shaved of half the time. I also looked at the mask reading part:
>
>    *   from1bit() could be optimized a bit (and also fails to issue a warn and break the loop if subsampling is enabled)
>    *   reading the jbig2 image in the JBIG2 library is very slow.
>
>
> I understand that JBIG2 is way more complex than CCITTFax but carefully investigation showed that of 2 seoncds, 0.5 was used for decoding the image itself (depending on page complexity this number can be lower/higher) and 1.5 for converting the bitmap into a BufferedImage. I optimized that 1.5 seconds away to a few milliseconds.
>
> If you are interested in any of this, I can go and clone the git repo and „implement“ my changes there so you can pull things back into the main repo that might be worth it?
>
> (What I can already say is that it‘s probably not going to be 100% formatting style compliant (no leading tabs is one thing, but the whitespaciness with curly brackets lines and no single line if statements I can’t guarantee)).
>
> Gunnar
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org