You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (Commented) (JIRA)" <ji...@apache.org> on 2011/12/31 13:00:31 UTC
[jira] [Commented] (PDFBOX-1000) Conforming parser

    [ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177992#comment-13177992 ] 

Maruan Sahyoun commented on PDFBOX-1000:
----------------------------------------

Just to let you know about the (slow) progress I'm doing.

I've made a decision to split the parsing in two parts. A (new) Lexer which reads the file and returns individual tokens (Number, Comment, NameObject, DictionaryStart, ArrayStart ...) and their type. This is controlled by the ConformingParser which when parsing certain parts of the PDF is looking for specific tokens. Reason behind that was to reduce the code within the individual classes and to allow for the ConformingParser to deal with higher level objects. The tokens return the raw data e.g. a Hex String is delivered as is. The ConformingParser needs to do the interpretation as I wanted to keep the semantics within the parser.

The Lexer part is ready with it's base functionality and will be extended as work continues completing the ConformingParser. Currently it also can only use RandomAccessFile which needs to be changed later on as I wanted to move forward with the ConformingParser. 

The ConformingParser from the high level is kept as Adam started to develop it but as individual functions are visited is starting to use the Lexer. I've also already changed some of the parameters from int to long  e.g. for the byte offset in the xref table as this defined to hold up to 10 digits inline with PDFBOX-1196.

The XrefEntry class has been extended to deal with regular Xref entries as well as Xref Stream entries i.e. the different properties are reflected in the class. This can be extended later to be usable when writing a PDF if the need arises.
                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1].  Once this is read, it will read in the xref table so it can locate other objects and revisions.  This also allows skipping objects which have been rendered obsolete (per the xref table)[2].  It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested.  This is all laid out in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser.  Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes.  Changes to existing classes will be kept to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira