You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2015/04/17 18:42:59 UTC

[jira] [Created] (PDFBOX-2762) remove parseCOSStream() call from PDFStreamParser

Tilman Hausherr created PDFBOX-2762:
---------------------------------------

             Summary: remove parseCOSStream() call from PDFStreamParser
                 Key: PDFBOX-2762
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2762
             Project: PDFBox
          Issue Type: Task
          Components: Parsing
    Affects Versions: 2.0.0
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 2.0.0


This code is found in PDFStreamParser
{code}
                if (c == '<')
                {
                    COSDictionary pod = parseCOSDictionary();
                    skipSpaces();
                    if ((char)pdfSource.peek() == 's')
                    {
                        retval = parseCOSStream( pod );
                    }
                    else
                    {
                        retval = pod;
                    }
                }
{code}
This is incorrect. PDFStreamParser is for content streams. There are no streams in content streams, the spec requires "All streams shall be indirect objects". An "indirect object" is something between obj and endobj. But indirect objects are not allowed in content streams: "Indirect objects and object references shall not be permitted at all". So parseCOSStream() will never be called. Thus the new code will be
{code}
                if (c == '<')
                {
                    retval = parseCOSDictionary();
                }
{code}
To be sure, I tested my own test set and the digitalcopora set (250000 files) to see whether parseCOSStream is ever called in PDFStreamParser. No it isn't. How did this incorrect code end up there? Don't know, but it has been there since 2002.
http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/src/org/pdfbox/pdfparser/PDFStreamParser.java?revision=1.1&view=markup

Why do I care about this? It is related to a posting in a mailing list by Andrea Vacondio who mentioned that there are several versions of parseCOSStream(), so I'm trying to clean up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org