You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "robo (Jira)" <ji...@apache.org> on 2021/05/17 12:10:00 UTC

[jira] [Commented] (PDFBOX-5068) OutOfMemory while signing large documents - continued

    [ https://issues.apache.org/jira/browse/PDFBOX-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346116#comment-17346116 ] 

robo commented on PDFBOX-5068:
------------------------------

[^issue5068.patch]

Dear PDFBox-Community,

I implemented a solution for the present memory-problem, which you can find in the attached git-patch. Here is a short description of the problem and how I proceeded (more details below):

1. Currently, when incrementally saving a PDF, PDFBox loads all COS Objects into memory (see COSWriter::prepareIncrement). This has the result that the memory usage increases in most cases linearly with the size of the to be saved PDF, although only small portion of the COS Objects is actually needed. Furthermore, this kind of “eager” loading of the COS Object has in some cases the undesirable side effect that the signing may take an unexpectedly long time (I listed an example of an 20MB PDF, which takes more than 12s).

2. The COSWriter::prepareIncrement method, again, makes use of two HashMaps two save the keys and COSObjects themselves in a bidirectional fashion. This bidirectional usage of the maps happens more than once within the COSWriter class. Thus, I first extracted this duplicated code into an EagerCOSWriterObjectStorage class (sub-typing an abstract COSWriterObjectStorage).

3. Finally, I implemented a LazyCOSWriterObjectStorage that retrieves the COS Objects only when they are actually required and tested it with different PDF files (using CreateVisibleSignature2.java). The results are roughly as follows:

=======================

1.3GB PDF (text)
*{color:#de350b}eager 16.4s (-Xmx1600M){color}* [requires >= -Xmx1600M on my PC]
lazy 8.3s (-Xmx1600M) 9.4s (-Xmx85M) [requires >= -Xmx85M]

200m PDF (scanned book, no OCR)
eager 7.6 (-Xmx230M) [requires >= -Xmx230M]
lazy 7.0 (-Xmx230M) 7.6 (-Xmx25M)

[https://www.bergundtal.ch/Data/Inhalte/PDF_BT_Jahreskatalog_2018_low.pdf]
50m PDF (text)
eager 1.8 (-Xmx85M) [requires -Xmx >= 85M]
lazy 1.4 (-Xmx85M) 1.6 (-Xmx25M)

[https://crossasia-books.ub.uni-heidelberg.de/xasia/reader/download/506/506-42-86246-2-10-20190822.pdf]
21M PDF (text)
*{color:#de350b}eager 13s (-Xmx130M){color}* [requires >= -Xmx130M]
lazy 1.4s (-Xmx130M) 1.6s (-Xmx70M) [requires >= -Xmx70M]

[minimum.pdf, attached to this JIRA Case]
31kb PDF (text)
eager 0.9 (-Xmx25M) 
lazy 0.9 (-Xmx25M)

=======================

More technical details and points of discussion:

• I was only aiming to provide a solution without refactoring to much of the current code. EagerCOSWriterObjectStorage should 1:1 replicate the behavior as it was before. The problem is that the COS-related code seems to be hardly maintainable and small changes may have unexpected results. One problem, for example, was that once a COSObject is de-referenced (COSObject::getObject), it is cached inside the COSObject instance, presumably for reasons of optimization but also, because COSObjects (de-referenced or not) are compared by reference (instance equality) and not semantically. This lack of equals/hashcode implementation inside COSObject was especially problematic, since COSObjects are used as keys in several HashMaps/HashSets. First I assumed, that a COSObjectKey can be used to identify a COSObject (there is also a COSObject::getKey). However, this was not case, or at least in the scope of COSWriter and the signing process. Furthermore, the key of the COSObject was not always set, which is yet another problem. Since I am still not sure how to compare to COSObjects semantically (that is in the way it is currently implemented and used), I eventually solved this problem by implementing a COSObject::getObjectWithoutCaching and by storing a reference to the COSObject in de-referenced object, rather than the other way around. A further problems is, that the state of many COS related classes is only defined at run-time and not at instance creation. This has also the result that the code is hard to test (partially, it seems, only integration tests are possible). I know that the code base is very old. What I write here should also by no means by understood as a critique. In contrary, my intention is only to document the problems I encountered and how it was rather complicated (at least for me as an outsider not familiar with the code) to make a change, which I expected should be much simpler.
 
 • Please consider the attached patch as a work in progress. I did not include the partial tests, on which the COSWriterObjectStorage is based on, and also did not add any documentation yet. I will add both the documentation and tests after the final clean up. Since everything appears to be working fine with the LazyCOSWriterObjectStorage only, it would be great if the EagerCOSObjectStorage could be dumped altogether. This would also have the advantage that there is no need to inject the type of COSWriterObjectStorage from the outside and further add complexity to the interface (which is why I have not implemented this part fully yet). However, I am not too sure whether the current tests actually cover most of the cases. There are also many things that could be improved, but would probably also induce more far-reaching changes. For example, it would be great if the COSDocument could be set at construction time. Furthermore, COSObject::getObject is still called from some other places in the code. Thus, some caching still takes place. Finally, there is also a COSObjectPool that already encapsulates a bidirectional key-object map. However, it did not really fit what I intended to do. Still, both could be possibly reconciled.
 
 • Anyhow, I think that resolving this issue would be a substantial improvement. Generally, I am skeptical of optimizations, but the present memory-problem touches not only edge cases. Actually, what I did here was not an optimization, but removing an apparently unnecessary optimization (caching), which caused the performance and memory problems in the first place.

I am looking forward to your suggestions and comments.

> OutOfMemory while signing large documents - continued
> -----------------------------------------------------
>
>                 Key: PDFBOX-5068
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5068
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Signing
>    Affects Versions: 2.0.23
>            Reporter: Ralf Hauser
>            Priority: Major
>         Attachments: RandomAccessReadBufferDiag.java, issue5068.patch, minimum.pdf
>
>
> Continuation of PDFBOX-2512
>  
> in COSWriter.prepareIncrement(), for the test case cosDoc.getXrefTable().keySet() has size 5925. For each of thes keys, cosDoc.getObjectFromPool() gets an object that is not just referencing some part of the input document, but duplicates it (which is unavoidable in the case where they are decompressed with FlateFilter - albeit this could possibly be done "lazy")
> -Xmx20m  746/5925
>  -Xmx25m 1615/5925
>  -Xmx30m 2800/5925
>  -Xmx40m 3872/5925
>  -Xmx55m 5773/5925
> With 60m, it gets them all, but dies later with less telling
>    java.lang.OutOfMemoryError: GC overhead limit exceeded
>  
> This assumes the patch of PDFBOX-5067 already in place - or using CreateVisibleSignature2.java as starting point



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org