You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by bnncdv <bn...@gmail.com> on 2023/08/28 11:30:52 UTC

RandomAccessReadBuffer performance issues with inputStreams in 3.0

When migrating from 2.0 to 3.0 I noticed some operations were very slow,
mainly the Splitter tool. With a big-ish file it would take *a lot* more
memory/cpu (jdk8).

I believe the culprit is RandomAccessReadBuffer with inputstreams. This
fully reads the stream in 4KB chunks (not a problem), however every time
createView(..) is called (on every PDPage access I think) it call a clone
RARB constructor, and all its ByteArray chunks are duplicate()'d which for
bigger files with many pages means *tons* of wasted objects + calls (even
if the underlying buf is the same). Simplifying that, for example by
reusing the parent bufferList rather than duplicting it uses the expected
cpu/memory (I don't know the implications though).

From simple observations Splitter seems to take x4 more cpu/heap. For
example I'd assume with a 100MB file of 300 pages (normal enough if you
deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300
pages = 7680000 objects created+gc'd in a short time, at least.

With smaller files (few pages) this isn't very noticeable, nor with
RandomAccessReadBufferedFile (different handling). Passing a pre-read
byte[] file to RandomAccessReadBuffer works ok (minimal dupes).
RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in
beta1. Either way, I don't think code should be copying/duping so much and
could be restructured, specially since the migration guide hints at using
RandomAccessReadBuffer for inputStreams.

Also, for RARB it'd make more sense to read chunks as needed in read()
rather than all at once in the constructor I think (faster metadata
query'ing). Incidentally, may be useful to increase the default chunk size
(or allow users to set it) to reduce fragmentation, since it's going the
read the whole thing and PDFs < 4kb aren't that common I'd say.

(I don't have a publishable example at hand but can be easily replicated by
using the PDFMergerUtility and joining the same non-tiny PDF xN times, then
splitting it).

Thanks.

Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0

Posted by bnncdv <bn...@gmail.com>.

Thanks, tried 3.0.1-SNAPSHOT and does seem fixed.

Just in case here is a basic example (simplified cleanup/etc):

> InputStream is = new FileInputStream(new File("/tests/big.pdf"));
> PDDocument doc = ...;
> //  PDDocument.load(is); //2.0.x
> //  Loader.loadPDF(new RandomAccessReadBuffer(is)); //3.0.x
>
> List<PDDocument> docs = new Splitter().split(doc); //timings here

With a ~70MB PDF file of 600 pages (created by joining a PDF with a
full-page image N times)
- 2.0.29 = ~0.5 sec, ~300MB; 3.0.0 = ~7 sec, ~3500MB; 3.0.1: ~0.9 sec,
~130MB
With a ~900MB PDF of 9600 pages (uncommon, but a real file sent by a
client):
- 2.0.29 = ~3.5 sec, ~3800MB; 3.0.0 = out of memory exception after ~30
sec; 3.0.1: ~0.9s, ~330MB

Not exact timings but ok enough to compare (those would vary/increase after
handling the List but not relevant here). High CPU probably depended on
Java/SDK version, since I assume it would be linked to GC calls for the
extra objects, and frequency/etc would vary per system, so was indirectly
fixed.

***

Also, for 2.0 we typically use:
- PDDocument.load(is, MemoryUsageSetting.setupMixed(MAX_BYTES))
that seems to reduce/control memory  a bit (at the cost of some CPU/etc).
Does 3.0 have some direct equivalent? Tried stuff like:
- Loader.loadPDF(rarb), null, null, null,
MemoryUsageSetting.setupMixed(MAX_BYTES).streamCache)
but doesn't seem to change much. 2.0 may be using Scratchfile internally
but not sure how to setup that in 3.0?


Thanks.

Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0

Posted by Andreas Lehmkühler <an...@lehmi.de.INVALID>.

Am 28.08.23 um 13:30 schrieb bnncdv:
> When migrating from 2.0 to 3.0 I noticed some operations were very slow,
> mainly the Splitter tool. With a big-ish file it would take *a lot* more
> memory/cpu (jdk8).
What exactly are you doing? I've tried to reproduce the issue and I've 
bee succesful with regard to the memory footprint but I can't confirm 
the higher cpu usage.

What exactly are doing? I've splitted the PDF spec, 32Mb file with more 
than 1.300 pages, into 2 pages pdfs and it can't see any difference with 
regard to the cup usage wether I use a file or a input stream.

However, I was able to reproduce the regression with regard to the 
memory consumption and fixed/optimized it in [1]


> I believe the culprit is RandomAccessReadBuffer with inputstreams. This
> fully reads the stream in 4KB chunks (not a problem), however every time
We have o do that as we need random access to the file. 2.0.x does the same

> createView(..) is called (on every PDPage access I think) it call a clone
> RARB constructor, and all its ByteArray chunks are duplicate()'d which for
> bigger files with many pages means *tons* of wasted objects + calls (even
> if the underlying buf is the same). Simplifying that, for example by
> reusing the parent bufferList rather than duplicting it uses the expected
> cpu/memory (I don't know the implications though).
> 
>  From simple observations Splitter seems to take x4 more cpu/heap. For
> example I'd assume with a 100MB file of 300 pages (normal enough if you
> deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300
> pages = 7680000 objects created+gc'd in a short time, at least.
> 
> With smaller files (few pages) this isn't very noticeable, nor with
> RandomAccessReadBufferedFile (different handling). Passing a pre-read
> byte[] file to RandomAccessReadBuffer works ok (minimal dupes).
RandomAccessReadBufferedFile has a builtin cache to avoid to many 
copies, see [1]

> RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in
> beta1. Either way, I don't think code should be copying/duping so much and
> could be restructured, specially since the migration guide hints at using
> RandomAccessReadBuffer for inputStreams.
Alpha3 did the same as final version 3.0.0. The removed method was 
redundant.

> Also, for RARB it'd make more sense to read chunks as needed in read()
> rather than all at once in the constructor I think (faster metadata
> query'ing). Incidentally, may be useful to increase the default chunk size
> (or allow users to set it) to reduce fragmentation, since it's going the
> read the whole thing and PDFs < 4kb aren't that common I'd say.
We have to read all data as need random access to the pdf. In many case 
on of the first steps is to jump to the end of the pdf to read the cross 
reference table/stream.

> (I don't have a publishable example at hand but can be easily replicated by
> using the PDFMergerUtility and joining the same non-tiny PDF xN times, then
> splitting it).
There has to be something special about your use case and/or pdf as I 
can't reproduce the cpu issue, see above.


Andreas

> Thanks.
> 


[1]  https://issues.apache.org/jira/browse/PDFBOX-5685

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org