You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jonathan (JIRA)" <ji...@apache.org> on 2019/06/24 12:53:00 UTC

[jira] [Commented] (PDFBOX-4567) Contribution of PDF Linearization

    [ https://issues.apache.org/jira/browse/PDFBOX-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871183#comment-16871183 ] 

Jonathan commented on PDFBOX-4567:
----------------------------------

Just wondering, are you interested in integrating this? The algorithm is working well, it's just a matter of exposing the API in a way that makes sense for you.

Right now, we do the following:
 * We use an own parser we proposed in PDFBOX-4542 to avoid having to load all object streams into memory. I'll still have to review if your recent implementation of the on-demand parser makes this superfluous, but I'm afraid it doesn't. We need to catalogue and rearrange all objects contained in the pdf, so we are going to walk through the entire pdf object tree, which in my understanding will trigger a complete parse.
 * After having parsed the document, we currently have a class PDFLinearizer which takes a COSDocument instance. Here it doesn't matter which parser we use. This class then applies a reimplementation of the QPDF Linearization algorithm. First, we use a PDFOptimizer to determine the order of the objects and to flatten out inherited attributes, then we write the entire PDF and do length calculations. In contrast to the implementation by QPDF, we do not write everything twice, one time into an empty pipe, the second time to a real file, instead, we subclassed COSWriter, mainly giving it the ability to reset itself, but giving us the written content. 
This way, we can avoid rewriting the great bulk of the pdf file, we can just store most objects in a datastructure and write them later. The only stuff that really needs to be written twice are the cross reference streams and the linearization dictionary.
 * PDFLinearizer will return a WrittenObjectStore, which has the ability to serialize itself to a pdf file. We used this mechanism as we never actually write the file, instead we are storing it in a object the emulates an InputStream, transparently resolving the references created by our parser modification. This is a rather special usecase for us, for the public we could potentially add two serializers, one of which writes the file to disk, another one would create a COSDocument/PDFDocument.
 * Similarly to QPDF, we don't support outline or thumb hint tables. But it shouldn't be too hard to implement actually as all the information is already there.

Regardless whether you do or don't wish to put the work into integrating this, I would appreciate an answer.

> Contribution of PDF Linearization
> ---------------------------------
>
>                 Key: PDFBOX-4567
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4567
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Utilities, Writing
>            Reporter: Jonathan
>            Priority: Major
>
> We've finally gotten the approval to publish our pdf linearization. How should we do it? I thought about publishing our current source as a fork of PDFBox on Github, then we could discuss about the best way to integrate the module into your API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org