You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Christian Appl (Jira)" <ji...@apache.org> on 2021/08/26 08:56:00 UTC

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

    [ https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405055#comment-17405055 ] 

Christian Appl edited comment on PDFBOX-5263 at 8/26/21, 8:55 AM:
------------------------------------------------------------------

Hmmm - it currently really is rather a POC and it still has a lot of issues... I will try to develop and suggest an updated version, which (besides other optimizations) does not require the initial usage of COSUpdateInfoList.
 Currently this initial iteration does access and load all objects of the COSTree, which does somewhat annihilate the Version 3 JIT loading logic. That class really should not exist and I will try to eliminate it.

*Just for transparency:*
 I somehow have to solve this issue one way or another and will have to develop a complete solution. I will share my results here but as this is a modification to the basics of PDFBox I can see why caution and prejudice is advised against my changes.

*Premises:*
 - This solution would have to use setters! Constructors of COS and PD structures must not be touched.
 - The preexisting "hasToBeUpdated" logic, shall work as currently implemented and shall not require to be changed.
 - Tests may not be changed and must resolve as is.
 - Objects shall never be forced to load, just to solve all this. (Contradicting the JIT parsing logic is forbidden.)

*Current work in progress and basic ideas for optimization:*
 The following would be my ideas how all this could be accomplished - this still is a raw concept and still does fail to do the trick (work in progress), but I still want to share it, for feedback and/or denial of that contribution/solution.
 (Also - somewhere along the line there is an error in my concept... still have to fix that - this works but still produces unstable results for dereferenced COSObjects.)
 - The COSDocument shall offer a "COSDocumentState" via a getter. That state may be altered at any time, and shall allow processing classes to indicate whether a document is currently being initialized and whether object modifications shall be collected or not.
 - COSBase objects shall have the ability to store the COSDocument they belong to, via getters and setters:
 !image-2021-08-26-09-52-33-567.png!

 - Parent structures shall inform their children, if a root COSDocument has been set and shall propagate such information throughout the COSTree, whenever a new child is being added:
 !image-2021-08-26-09-54-24-897.png!
 Example usage via COSDictionary#setItem():
 !image-2021-08-26-10-00-07-383.png!

 - Also parent structures shall inform their children, when a new base document has been set:
 Example usage via COSDictionary#setDocument():
 !image-2021-08-26-10-02-08-003.png!
 - Resulting in: The update states of objects can be set in accordance with the document state:
 Example usage via COSUpdateInfoBundle (from the given patch):
 !image-2021-08-26-10-03-47-940.png!

 - The COSBase of COSObjects shall only then be updated, if that object is dereferenced, therefore delaying initialization in accordance with the JIT logic (this still is causing issues - but that is the raw idea.)
*Edit:* Found the issue "copyStateTo" is erroneous and does fail the tests as it is, therefore it has been replaced.
 !image-2021-08-26-10-06-42-925.png!

 - And finally the first node being told the COSDocument, is the trailer itself via:
 COSDocument#setTrailer():
 !image-2021-08-26-10-09-12-698.png!
 Which will then cause iterative calls to "setDocument" whenever a child is added to another COS structure (i.e. if a Parser is iteratively loading a document).

This could be used by PDFParser like:
 !image-2021-08-26-10-12-19-265.png!
 The Loader would remain untouched and the document would only be iterated as far as a parser (or user) requires - only unpacking required structures at the end, when the document is actually written.


was (Author: capsvd):
Hmmm - it currently really is rather a POC and it still has a lot of issues... I will try to develop and suggest an updated version, which (besides other optimizations) does not require the initial usage of COSUpdateInfoList.
Currently this initial iteration does access and load all objects of the COSTree, which does somewhat annihilate the Version 3 JIT loading logic. That class really should not exist and I will try to eliminate it.

*Just for transparency:*
 I somehow have to solve this issue one way or another and will have to develop a complete solution. I will share my results here but as this is a modification to the basics of PDFBox I can see why caution and prejudice is advised against my changes.

*Premises:*
- This solution would have to use setters! Constructors of COS and PD structures must not be touched.
- The preexisting "hasToBeUpdated" logic, shall work as currently implemented and shall not require to be changed.
- Tests may not be changed and must resolve as is.
- Objects shall never be forced to load, just to solve all this. (Contradicting the JIT parsing logic is forbidden.)

*Current work in progress and basic ideas for optimization:*
The following would be my ideas how all this could be accomplished - this still is a raw concept and still does fail to do the trick (work in progress), but I still want to share it, for feedback and/or denial of that contribution/solution.
(Also - somewhere along the line there is an error in my concept... still have to fix that - this works but still produces unstable results for dereferenced COSObjects.)
- The COSDocument shall offer a "COSDocumentState" via a getter. That state may be altered at any time, and shall allow processing classes to indicate whether a document is currently being initialized and whether object modifications shall be collected or not.
- COSBase objects shall have the ability to store the COSDocument they belong to, via getters and setters:
!image-2021-08-26-09-52-33-567.png!

- Parent structures shall inform their children, if a root COSDocument has been set and shall propagate such information throughout the COSTree, whenever a new child is being added:
!image-2021-08-26-09-54-24-897.png!
Example usage via COSDictionary#setItem():
!image-2021-08-26-10-00-07-383.png!


- Also parent structures shall inform their children, when a new base document has been set:
Example usage via COSDictionary#setDocument():
!image-2021-08-26-10-02-08-003.png!
- Resulting in: The update states of objects can be set in accordance with the document state:
Example usage via COSUpdateInfoBundle (from the given patch):
!image-2021-08-26-10-03-47-940.png!

- The COSBase of COSObjects shall only then be updated, if that object is dereferenced, therefore delaying initialization in accordance with the JIT logic (this still is causing issues - but that is the raw idea.)
!image-2021-08-26-10-06-42-925.png!

- And finally the first node being told the COSDocument, is the trailer itself via:
COSDocument#setTrailer():
!image-2021-08-26-10-09-12-698.png!
Which will then cause iterative calls to "setDocument" whenever a child is added to another COS structure (i.e. if a Parser is iteratively loading a document).

This could be used by PDFParser like:
!image-2021-08-26-10-12-19-265.png!
The Loader would remain untouched and the document would only be iterated as far as a parser (or user) requires - only unpacking required structures at the end, when the document is actually written.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-5263
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: Enhanced_incremental_saving_PDFBox3.patch, image-2021-08-23-14-55-24-077.png, image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an incremental PDF) is , that in between signatures the document changed and the additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would find, that most of the time made changes are completly ignored, when calling "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching constructors in "COSWriter", this would require, to identify the "path" of all made changes and one would need to set the "needToBeUpdated" flag of all elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and how the PDF standard does implement this, he would have to identify said structures and the more complex the changes were, the more tedious this would become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might actually have changed.
> e.g. If one added an image to a preexisting page of the document - the contentstream, the resources of the page and the page dictionary would have changed. But the "pages" array and all it's ancestors would not have changed a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been called).
> - If "COSUpdateInfo" objects were self aware and would solve this by themselves, it would not be necessary anymore to set update states manually.
> *Problems:*
> The first and obvious problem is, that the initial loading of a document is creating and altering new COS structures and we obviously don't want objects to observe and remember those changes. An object that is created during document initialization must be treated as preexisting.
> However: COSBase is not context aware - it does know it's descendants, but neither does it know it's parent, nor does it know it's root.
> If it was, that actually would present the optimal solution, as in that case the Object could ask it's root for the current load state and therefore would be able to ignore said changes caused by the initial loading of a document.
> But it is not. (My opinion is - it should be! But more on that later.)
> Therefore a a helper named COSUpdateInfoList was implemented, which was capable of finding COSUpdateInfo objects in a COS structure, and that allowed resetting their update state after loading was completed.
> *Description of the patch:*
> The patch implements selfaware COSUpdateInfo objects, which the COSWriter has been adapted to process. PDFBox therefore is capable of monitoring changes in realtime and to automatically include altered structures in an incremental save of the document, therefore creating increments (or an increment), that a signature would sign.
> *Result:*
> Using this patch documents could be created:
> incrementally adding pages, adding contents to pages, adding annotations, altering structures, removing structures.
> As far as has been initially tested the resulting documents were valid, viewable in a reader and the objects overwritten in increments seemed correct.
> *But -* *Caveat:*
> This patch does introduce atleast one ugly class (most likely you will be able to point out more, that could be optimized :)) and that is "COSUpdateInfoList" - as already explained: In my opinion such a class should not exist, the COSUpdateInfo objects should be context aware and should be capable of regulating their own behaviour.
> Whenever the alternatives are to either manage an object externally, or to "teach" an object to solve problems autonomously, I will tend to prefer the latter... but I did not dare to do that.
> This would require, that either further constructors or setters would have to be introduced for COSBase objects, that allowed setting parent/root/context for the object.
> Which would result in further massive changes for using applications and PDFBox itself - as all instantiations of COSBase objects (PDObjects) would have to be adapted.
> However: I would prefer if COSBase objects actually were context aware.
> But as stated... I did not dare to touch it and instead chose the ugly workarround, that would introduce yet another iteration over the whole COSDocument structure.
> Eliminating COSUpdateInfoList would be preferable!
> *Suggestion:*
> As PDFBox 3 is already changing how documents and objects are handled, I would suggest, that also COSBase objects should be made context and selfaware in PDFBox 3.
> This would allow simplifying handling COS objects using PDFBox and it would allow for an easier and automized handling of incremental saving.
> *Usage example:*
> The following "pseudo code" (actually using simplified Helper classes) demonstrates the intended usage:
> !image-2021-08-23-14-55-24-077.png!
> *As always:* Thank you very much for your work and support! I hope this suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org