You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (Created) (JIRA)" <ji...@apache.org> on 2012/01/02 14:46:30 UTC

[jira] [Created] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Non-sequential PDF parser + PATCH
---------------------------------

                 Key: PDFBOX-1199
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 1.6.0
            Reporter: Timo Boehme


Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.

Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.

In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.

The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):

try {
    // ---- try first with (mostly) standard conform parsing 
    doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
    handleDocument(doc);
} catch ( IOException ioe ) {
    // ---- retry with sequential parser and force parsing
    doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
    handleDocument(doc);
}

For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Maruan Sahyoun (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178405#comment-13178405 ] 

Maruan Sahyoun commented on PDFBOX-1199:
----------------------------------------

I had a quick look at the changes made and I think that this is a very good step forward. The new parsing of the xref should resolve a lot of current issues as do a lot of the other changes. As I'm currently working on PDFBOX-1000 maybe we could have a quick chat about how to combine the efforts.
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>         Attachments: 2012-01-02_NonSequentialParser.patch, NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1199:
--------------------------------

    Attachment: NonSequentialPDFParser.java
                RandomAccessBufferedFileInputStream.java
                2012-03-12_NonSeqParser_neededChanges.patch

The missing decryption support is now added to the parser. The parser is tested on multiple thousands of documents and could now be added.
I have attached the current parser (which will get some cleanup/reformatting before committing) and the required changes/additions; especially small restructuring in the security handler classes in order to use decryption for single objects directly by the parser (move most of the 'decrypt' preparation code out to an extra method).

If no one objects I would commit the code in small steps: (1) add the RandomAccessBufferedFileInputStream with small change in PushBackInputStream, (2) add changes to security handler classes, (3) add the parser.
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>         Attachments: 2012-01-02_NonSequentialParser.patch, 2012-03-12_NonSeqParser_neededChanges.patch, NonSequentialPDFParser.java, NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1199:
--------------------------------

    Attachment: RandomAccessBufferedFileInputStream.java

file input stream implementation which caches read data and allows seek operations
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>         Attachments: NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1199:
--------------------------------

    Attachment: NonSequentialPDFParser.java

the new non-sequential PDF parser
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>         Attachments: NonSequentialPDFParser.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Timo Boehme (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme resolved PDFBOX-1199.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0

With commit revisions 1311015, 1311016, 1311018 and 1311020 the new parser is added to PDFBOX. Simply call the new PDDocument.loadNonSeq method to use this parser. If parsing fails (e.g. wrong startxref) one can fall back to standard parser with the other load methods.
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>             Fix For: 1.7.0
>
>         Attachments: 2012-01-02_NonSequentialParser.patch, 2012-03-12_NonSeqParser_neededChanges.patch, NonSequentialPDFParser.java, NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Timo Boehme (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme reassigned PDFBOX-1199:
-----------------------------------

    Assignee: Timo Boehme
    
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>         Attachments: 2012-01-02_NonSequentialParser.patch, NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1199:
--------------------------------

    Attachment: 2012-01-02_NonSequentialParser.patch

changes to existing classes in order to use the new parser (please apply PDFBOX-1196 before this patch set)
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>         Attachments: 2012-01-02_NonSequentialParser.patch, NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with stream parsing and skipping unused content. The solution is a conforming parser which first reads XREF tables and uses this information to only parse required objects and uses length information for stream parsing. A completely new implementation of such a parser is currently worked on in PDFBOX-1000. While this parser will be the long term solution a short term solution based on existing code would be desirable. A first incomplete solution was presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' conforming parser, called 'non-sequential parser', which handles all PDF documents (even inlined, with object streams etc.). The parser can be used as a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method parse and getPage method. The only restriction is currently the need to specify a file instead of an input stream. In order to efficiently read the file and use it with the existing object parsing code I developed a RandomAccessBufferedFileInputStream which allows InputStream operations in combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on existing classes are needed. This includes changing some methods/fields from private to protected in PDFParser, add parsing of stream object information from XREF streams, store and get this information from XrefTrailerResolver (object ids are stored negated in order to distinguish them from offsets) and allow resetting offset in PushBackInputStream. All these changes do not change behavior of current parser. Another requirement is the long offset patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in an IOException if a parsing error occurs. In most cases this shouldn't be a problem since in my use cases exceptions typically occur trying to parse unused content or streams which with this new parser are no problems anymore. In my setup I use the new parser first and if a parsing error occurs, fall back to the sequential parser (a bit like Acrobat does it if XREF information is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a large step forward to parse all documents also accepted by common PDF tools. While its behavior is nearly 'conform' there is nevertheless a need for a clean 'real' conforming parser. For instance since the underlying object structure has no access to the parser it is necessary to first parse all objects before they can be used. This includes objects that might not be needed at all. Another normally not needed step is copying the content of a stream. Since we work on a file with random access there would be no need for it. However this parser should fill the hole until a full featured and clean conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira