You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2020/01/27 19:01:00 UTC

[jira] [Resolved] (PDFBOX-4569) Implement an ondemand Parser

     [ https://issues.apache.org/jira/browse/PDFBOX-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-4569.
----------------------------------------
    Resolution: Fixed

I guess we are done here so far. Any further optimization should have it's own ticket.

+Summary+

The parser starts with reading all cross reference informations and creates the trailer object holding the root dictionary. All other objects are read on demand processing the following steps

* create a COSObjectKey for the object number
* get the COSObject for the COSObjectKey by calling COSDocument#getObjectFromPool
* COSObject#getObject dereferences the COSBase we are looking
* the interface ICOSParser was introduced to decouple COSObject and the parser used to dereference the object
* COSParser implements the interface and does the parsing
* the COSBase object is cached in COSObject for further use
* objects within an object stream are dereferenced one by one

All of this is done automagically so that the end user doesn't have to change anything to use the on demand parser.

+Some important details+
* less memory consumption if one doesn't need all objects, e.g. text extraction doesn't need to read image informations
* no performance regression so far, loading is way much faster, but the parser needs more time to load the objects on demand if the number of objects to be processed is nearly the same in both cases (on demand vs old parser)
* the more objects are needed/loaded the lesser are the positive memory effects as all objects are cached and in the end the memory footprint is nearly the same

+Some findings for further optimizations+
I've tried to deactivate the caching of objects within COSObject. Instead of storing them I've simply reloaded the objects. That doesn't work as there maybe changes made to the loaded objects which are reverted when reloading them. IMHO the main cause of this effect is the fact that the two layers (COS and PD) are glued together to one layer which doesn't support such changes. One idea could be to really separate both layers by creating PD objects from COS objects without using them for storage and drop the COS objects afterwards. That would be a huge effort.

I've tried to use memory mapped files as input but stumbled upon our scratch file implementation. IMHO we have to drop/change that first if we want to support memory mapped files in combination with on demand parsing.



> Implement an ondemand Parser
> ----------------------------
>
>                 Key: PDFBOX-4569
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4569
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: PDFBOX-1084.pdf
>
>
> There is a need to replace the big bang parser with an ondemand parser



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org