You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Christian Appl (Jira)" <ji...@apache.org> on 2021/09/23 14:09:00 UTC
[jira] [Comment Edited] (PDFBOX-4952) PDF compression - object stream creation

    [ https://issues.apache.org/jira/browse/PDFBOX-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419226#comment-17419226 ] 

Christian Appl edited comment on PDFBOX-4952 at 9/23/21, 2:08 PM:
------------------------------------------------------------------

The following can not be reproduced using PDFBox trunk and therefore is my private issue - well... where did I go wrong?

-We found another issue and have not found an explanation yet.-

-*When compressing the document: [^problematic.pdf]*-

-*Different applications and their reactions:*-
 - -Adobe DC refuses opening the document (EC 14 "There was an error opening this document. There was a problem reading this document." Detail: "Expected a dict object.")-
 - -PDFBox Debugger aswell as-
 - -Chrome have no issue displaying the document properly and at first glance the object structure is identical.-
 - -Firefox however (pdfjs I assume) reports "Invalid top-level pages dictionary."-
 - -PDDocument loads the document without issues.-

-"Basic" PDF parsers seem to have no issue processing the resulting PDF whatsoever. But some object seems to be included in an object stream, that should not be contained.-

-*Workarround:*-
 -If PDFBox was used to save the document first (without applying compression) the result of a following compressed save is fine and unproblematic.-

-*What I assumed and tried up to now:*-
 -The document does originally already contain an object stream - I assumed that possibly compressed saving could have problems processing already contained object streams (a major flaw, if that was the case) and saved a document repeatedly using compressed saving - which turned out fine and did not cause any issues.-

-*Conclusion:*-
 -In it's original form the PDF contains some structure, that results in an erroeneous compression, which PDFBox is capable of removing, but which the object streaming logic can not process correctly.-

-I did not try yet, whether this can be reproduced using the current PDFBox trunk however and will do so now. (will report back)-


was (Author: capsvd):
We found another issue and have not found an explanation yet.

*When compressing the document: [^problematic.pdf]*

*Different applications and their reactions:*
- Adobe DC refuses opening the document (EC 14 "There was an error opening this document. There was a problem reading this document." Detail: "Expected a dict object.")
- PDFBox Debugger aswell as
- Chrome have no issue displaying the document properly and at first glance the object structure is identical.
- Firefox however (pdfjs I assume) reports "Invalid top-level pages dictionary."
- PDDocument loads the document without issues.

"Basic" PDF parsers seem to have no issue processing the resulting PDF whatsoever. But some object seems to be included in an object stream, that should not be contained.

*Workarround:*
If PDFBox was used to save the document first (without applying compression) the result of a following compressed save is fine and unproblematic.

*What I assumed and tried up to now:*
The document does originally already contain an object stream - I assumed that possibly compressed saving could have problems processing already contained object streams (a major flaw, if that was the case) and saved a document repeatedly using compressed saving - which turned out fine and did not cause any issues.

*Conclusion:*
In it's original form the PDF contains some structure, that results in an erroeneous compression, which PDFBox is capable of removing, but which the object streaming logic can not process correctly.

I did not try yet, whether this can be reproduced using the current PDFBox trunk however and will do so now. (will report back)

> PDF compression - object stream creation
> ----------------------------------------
>
>                 Key: PDFBOX-4952
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4952
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>    Affects Versions: 2.0.21
>            Reporter: Christian Appl
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: 102_Spot_to_CMYK_X1a.pdf, 102_Spot_to_CMYK_X1a_unc_BAD-3.0.0.pdf, 102_Spot_to_CMYK_X1a_unc_GOOD-2.0.22.pdf, image-2020-09-07-09-47-30-172.png, image-2020-09-07-10-05-15-631.png, image-2021-08-17-10-07-33-682.png, image-2021-08-17-10-10-21-418.png, image-2021-08-17-10-21-00-352.png, image-2021-08-17-10-24-44-999.png, image-2021-08-17-10-56-48-431.png, problematic.pdf
>
>
> I implemented a basic starting point to realize a PDF compression based on PDFBox 2.0.22-SNAPSHOT
> I want to use this ticket, to ask if you would be interested in such a feature and whether you would be interested to merge it into PDFBox.
> This is sort of a POC, only implementing some very basic functionality, that surely must and could be extended further and it does only implement some very basic and simplistic Unit Tests.
>  However it is able to reduce the size of resulting documents, and creates objectstreams as defined in the PDF reference manual.
> *What it currently does:*
>  It provides the bundling and compression of objects to objectstreams -and further applies simple content compression to a small selection of contents-.
> -To realize content compression, it provides a simple interface and abstract class for "ContentCompressor"s which search a document for specific content, that could be compressed and do compress that contents.-
> -Currently two content compressors exist:-
>  -_ImageCompressor_-
>  -Searches for simple images, that could be compressed using DCT.-
> -_UnencodedStreamCompressor_-
>  -Searches the document for yet unencoded streams and applies a Flate compression where necessary.-
> -Both compressors can be parameterized using a centralized "CompressParameters" instance which is passed to a new "saveCompressed" method of PDDocument.-
> The compression is based on, modifies and is realized by a set of extensions for the "COSWriter" class. Basically it organizes objects, that are passed to the COSWriter in objectStreams -and applies content optimization where necessary and possible-.
> Currently this does support encryption, but does not support linearization of the compressed documents.
> *Caveat:*
>  If this feature is interesting to you, then I would not expect you to simply merge this fork into 2.0.22. I am expecting that you would like to have some details and concepts changed and am ready to implement changes that would be required for this to work to your liking.
> *POC:*
>  4 resulting documents can be found in "target/test-output/compression" when "COSDocumentCompressionTest" is run.
> *The Pull request can be found on Github at:*
>  [https://github.com/apache/pdfbox/pull/86]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org