You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2012/05/21 23:20:41 UTC
[jira] [Resolved] (PDFBOX-1098) Wrong implemented stream reader

     [ https://issues.apache.org/jira/browse/PDFBOX-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme resolved PDFBOX-1098.
---------------------------------

       Resolution: Duplicate
    Fix Version/s: 1.7.0
         Assignee: Timo Boehme

Solved in PDFBOX-1299 for PDFParser and direclty specified length (length specified via indirect object is not reliable for sequential parser). NonSequentialPDFParser does not have this problem.
                
> Wrong implemented stream reader
> -------------------------------
>
>                 Key: PDFBOX-1098
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1098
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Reporter: Thomas Chojecki
>            Assignee: Timo Boehme
>            Priority: Critical
>              Labels: ZLIB
>             Fix For: 1.7.0
>
>
> The BaseParser#readUntilEndStream(OutputStream) method is parsing streams the wrong way. [1]
> This method will start reading a stream till the keyword "endstream" is reached and don't care about the length value inside the dictionary. This implementation brokes nearly every pdf document with a pdf embedded inside a stream [2].
> Encoder that is used for compressing streams can be block-based (like FlateDecode which is mostly used). If a block of data that should be compressed don't spare space after compressing, the encode do not compress this block and mark it as uncompressed. So a stream can containing compressed and uncompressed parts. So if someone try to embed pdf documents with streams inside a stream, the encoder will left most parts of the document uncompressed. Such parts can contain plan text like "endstream" or other critical keywords that can cause the parser to stop. 
> So we need to read the whole stream length that was wrote inside the dictionary and don't look at "endstream" keywords until the end is reached.
> The current stream parser cause a ZIPException with the Message "Unexpected end of ZLIB input stream".
> A sample pdf and a patch is coming soon.
> [1] PDF 32000-1:2008 -> 7.3.8.2 Stream Extent
> [2] PDF 32000-1:2008 -> 7.11.4 Embedded File Streams

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira