You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Adam Nichols (Commented) (JIRA)" <ji...@apache.org> on 2012/01/01 11:39:30 UTC

[jira] [Commented] (PDFBOX-1000) Conforming parser

    [ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178138#comment-13178138 ] 

Adam Nichols commented on PDFBOX-1000:
--------------------------------------

Lexer: I like the idea of keeping the code as small and independent as possible.

XRef Streams: cool, glad to see that's been added!

Modes of parsing: The strict is good because it can help developers of other products to ensure their products conform.  It may also help prevent unknown attacks from working since it will just bail with an error message when it gets a malformed PDF (doesn't help with flaws which may be in the protocol itself, but then again not much will help there).  The relaxed parsing is also a nice option since people expect the software to "just work" even if there are small errors with the file.  I'm going to say that I don't like the idea of trying to clone what Adobe Acrobat does.  It varies with each version of the PDF spec (at a minimum), is much more complex than is necessary, has been plagued by security problems, and serves no advantage over the strict/relaxed modes.  I'd rather do what's right (throw an exception if a PDF is non-conforming) or what's popular (parse anything in the best way we know how) which is decided by the person who uses the library.

Please make sure to include references to the spec when relevant.  For example, I'm not aware of anything which says "startxref is expected to be within the last 1024 bytes."  I'd imagine that'd normally be the case, but if the xref table is very large, I could imagine that would sometimes not be the case.

My circumstances have drastically changed since I last worked on this (in June), so I can't dedicate nearly as much time as I could before.  However, I'm still interested in following the progress and helping out when and where I can.  On the brighter side, I should now be able to make sure all the PDFs I use will be able to be committed for JUnit test cases.  If there are any small things which need done related to the conforming parser, feel free to mention them either here or on the developer mailing list and I'll know where I can jump in and help if I get some free time.
                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1].  Once this is read, it will read in the xref table so it can locate other objects and revisions.  This also allows skipping objects which have been rendered obsolete (per the xref table)[2].  It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested.  This is all laid out in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser.  Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes.  Changes to existing classes will be kept to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira