You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2012/01/11 18:04:39 UTC

[jira] [Issue Comment Edited] (PDFBOX-1000) Conforming parser

    [ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184195#comment-13184195 ] 

Maruan Sahyoun edited comment on PDFBOX-1000 at 1/11/12 5:03 PM:
-----------------------------------------------------------------

Continuing the work on the parser maybe someone more experienced in PDFBOX can help me with mapping the basic PDF objects as documented in ISO 32000 to the COS model classes in PDFBOX

Comment [IS0 32000-1:2008: 7.2.3] -> none?
Boolean [IS0 32000-1:2008: 7.3.2] -> COSBoolean?
Number [IS0 32000-1:2008: 7.3.3] -> COSReal, COSInteger?
Literal String [IS0 32000-1:2008: 7.3.4.2] -> COSString?
Hex String [IS0 32000-1:2008: 7.3.4.3] -> COSString?
Name Object [IS0 32000-1:2008: 7.3.5] -> COSName?
Keyword [IS0 32000-1:2008: 7.3] (the spec doesn't have that as a type but as part of some other types) -> none?
Array Objects [IS0 32000-1:2008: 7.3.6] -> COSArray?
Dictionary Objects [IS0 32000-1:2008: 7.3.7] -> COSDictionary?
Stream Objects [IS0 32000-1:2008: 7.3.8] -> COSStream?
Null Object [IS0 32000-1:2008: 7.3.9] -> COSNull?
Indirect Objects [IS0 32000-1:2008: 7.3.10] ?

What are the other classes in o.a.pdfbox.cos for

If wanted I can also move forward and include some comments from the ISO spec into the a.o.pdfbox.cos classes documentation.
                
      was (Author: msahyoun):
    Continuing the work on the parser maybe someone more experienced in PDFBOX can help me with mapping the basic PDF objects as documented in ISO 32000 to the COS model classes in PDFBOX

Comment [IS0 32000-1:2008: 7.2.3] -> none?
Boolean [IS0 32000-1:2008: 7.3.2]</li>
Number [IS0 32000-1:2008: 7.3.3] -> COSReal, COSInteger?
Literal String [IS0 32000-1:2008: 7.3.4.2] -> COSString?
Hex String [IS0 32000-1:2008: 7.3.4.3] -> COSString?
Name Object [IS0 32000-1:2008: 7.3.5] -> COSName?
Keyword [IS0 32000-1:2008: 7.3] (the spec doesn't have that as a type but as part of some other types) -> none?
Array Objects [IS0 32000-1:2008: 7.3.6] -> COSArray?
Dictionary Objects [IS0 32000-1:2008: 7.3.7] -> COSDictionary?
Stream Objects [IS0 32000-1:2008: 7.3.8] -> COSStream?
Null Object [IS0 32000-1:2008: 7.3.9] -> COSNull?
Indirect Objects [IS0 32000-1:2008: 7.3.10] ?

What are the other classes in o.a.pdfbox.cos for

If wanted I can also move forward and include some comments from the ISO spec into the a.o.pdfbox.cos classes documentation.
                  
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1].  Once this is read, it will read in the xref table so it can locate other objects and revisions.  This also allows skipping objects which have been rendered obsolete (per the xref table)[2].  It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested.  This is all laid out in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser.  Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes.  Changes to existing classes will be kept to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira