You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2012/05/21 16:08:41 UTC

[jira] [Closed] (PDFBOX-1299) BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs

     [ https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme closed PDFBOX-1299.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0

While the new NonSequentialPDFParser uses length information in every case the sequential working PDFParser did not since in most cases length information is an indirect object with its value provided later. However there are some cases where this information is present at point of parsing the stream (like in the example) and for these cases I applied your patch with small modification. I only kept length information if defined directly because in case of indirect object we do not know if this will be revised later on (since PDFParser hasn't read xref table at this time) and thus we would read wrong number of bytes. However this limitation shouldn't be a real one since indirect defined length comes typically after stream object.

Fixed in rev. 1341030; thanks for contribution
                
> BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs
> -------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1299
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1299
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Michael McCandless
>            Assignee: Timo Boehme
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX-1299.patch, Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf
>
>
> The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
> copying bytes to the output, stopping once it sees "endstream".
> The problem with this approach is sometimes the stream data itself
> contains endstream causing readUntilEndOfStream to stop too early.
> This can legitimately happen when the stream is an embedded PDF; I'll
> attach a test PDF showing this.
> However, the stream dict declares the stream length (in bytes)...  so
> it seems like we should be respecting that length (if present) and
> simply copy over that many bytes, instead of scanning the stream bytes
> for endstream?  This should be a lot faster too...
> I imagine we always scan so that we are more robust if the length is
> missing/invalid?  Is that why this method was used?  (I don't know the
> history here...).  If so, maybe we can have an option to use
> the declared stream length if present.
> I have a patch to use the declared stream length (if present), and it enables
> at least this test PDF to correctly parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira