You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Ad...@swmc.com on 2010/12/07 19:14:51 UTC

Conforming parser

I'm trying to write a conforming parser, which should help out with 
various issues, and I'm hoping that someone can help me understand the PDF 
spec so I can get this done exactly to the specifications.

I noticed in 7.5.5 of ISO 32000-1:2008 it says that the startxref location 
is the byte-offset from "the decoded stream".  This seems strange that it 
would be the *decoded* position if the first thing to do is to skip to the 
end of the file and read the EOF flag, xref location and trailer info. 
Does this mean that the expected process would be to read and decode the 
entire stream and write it to a temp file (or hold it in memory) before 
skipping to the end, reading the EOF flag, etc.?

If this is correct, I'll just read in the File/InputStream/URL/URI/etc. 
and decode/write it to a RandomAccess object.  This should keep memory 
usage low since I'll be working off the RandomAccess object, so a 500MB 
PDF won't require 500MB of memory (and I have dealt with PDFs this large).

Finally, as a test, I ran WriteDecodedDoc on my test document and then I 
expected the xref table to match up, but it still wasn't pointing to the 
location I expected.  Is there any existing code in PDFBox which would 
help me read/decode/write a PDF?

Any other suggestions, words of warning, etc.?  Like, how should I deal 
with violations of the spec?  Log and ignore, throw exception, have an 
object which deals with exceptions on a case-by-case basis?  It'd be 
pretty cool to have an object which would be smart enough to look and see 
"Read: '%%EO'; Expected: '%%EOF'" and not throw an exception, but if it 
were "Read: 'obj 49 0'; Expected: '%%EOF'" it might throw an exception. 
But I'm not going to go through the work of doing all that unless people 
will actually find it useful.  Maybe the conforming PDF parser could just 
throw an exception for non-conforming documents and then fall back to the 
PDFParser?  I'm looking for input from the community here.  Let me know 
what you think.

---- 
Thanks,
Adam



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

Re: Conforming parser

Posted by Ad...@swmc.com.
+1 non-conforming PDFs should be able to be parsed (which is why I like 
the idea of using the ConformingPDFParser and if we find the document is 
non-conforming, falling back to the PDFParser)

Reading and unreading should not be an issue with the ConformingPDFParser 
because it's not going to be processed from top to bottom, it will be 
random access.

I will be sure to take a look at PDFBOX-908 to see how the new parser 
handles these problems.

---- 
Thanks,
Adam



From:
"martijn.list" <ma...@gmail.com>
To:
dev@pdfbox.apache.org
Date:
12/07/2010 12:21
Subject:
Re: Conforming parser



I'm sorry I cannot help you with the startxref issue but I have some
thoughts about parsing non-conforming PDFs.

> Any other suggestions, words of warning, etc.?  Like, how should I
> deal with violations of the spec?

I think it's important to graceful handle non-conforming PDFs. Currently
PDFBox cannot handle certain PDFs that are can be read by most PDF
readers. PDFBox should imho try it's best to cope with PDF errors if
forceParsing is enabled.

I have added a JIRA entry
(https://issues.apache.org/jira/browse/PDFBOX-908) which contains some
patches to make PDFBox parse a large batch of commercial ebooks. I have
added a couple of PDF example to the JIRA entry that try to mimic the
problems I found in read life ebooks. The example PDFs cannot always be
opened by Acrobat because they are created by me using a text editor.
The problems that were replicated were copied from PDFs that could be
opened by Acrobat.

What I think is important is that in case of an exception, the parser
should not unread the data. If data is unread when an exception occurs
the parser can get stuck in an unlimited loop (for example
test-integer-too-large.pdf results in unlimited loop on current PDFBox).

Kind regards,

Martijn Brinkers


On 12/07/2010 07:14 PM, Adam@swmc.com wrote:
> I'm trying to write a conforming parser, which should help out with 
> various issues, and I'm hoping that someone can help me understand the 
PDF 
> spec so I can get this done exactly to the specifications.
> 
> I noticed in 7.5.5 of ISO 32000-1:2008 it says that the startxref 
location 
> is the byte-offset from "the decoded stream".  This seems strange that 
it 
> would be the *decoded* position if the first thing to do is to skip to 
the 
> end of the file and read the EOF flag, xref location and trailer info. 
> Does this mean that the expected process would be to read and decode the 

> entire stream and write it to a temp file (or hold it in memory) before 
> skipping to the end, reading the EOF flag, etc.?
> 
> If this is correct, I'll just read in the File/InputStream/URL/URI/etc. 
> and decode/write it to a RandomAccess object.  This should keep memory 
> usage low since I'll be working off the RandomAccess object, so a 500MB 
> PDF won't require 500MB of memory (and I have dealt with PDFs this 
large).
> 
> Finally, as a test, I ran WriteDecodedDoc on my test document and then I 

> expected the xref table to match up, but it still wasn't pointing to the 

> location I expected.  Is there any existing code in PDFBox which would 
> help me read/decode/write a PDF?
> 
> Any other suggestions, words of warning, etc.?  Like, how should I deal 
> with violations of the spec?  Log and ignore, throw exception, have an 
> object which deals with exceptions on a case-by-case basis?  It'd be 
> pretty cool to have an object which would be smart enough to look and 
see 
> "Read: '%%EO'; Expected: '%%EOF'" and not throw an exception, but if it 
> were "Read: 'obj 49 0'; Expected: '%%EOF'" it might throw an exception. 
> But I'm not going to go through the work of doing all that unless people 

> will actually find it useful.  Maybe the conforming PDF parser could 
just 
> throw an exception for non-conforming documents and then fall back to 
the 
> PDFParser?  I'm looking for input from the community here.  Let me know 
> what you think.
> 
> ---- 
> Thanks,
> Adam
> 
> 
> 
> - FHA 203b; 203k; HECM; VA; USDA; Conventional 
> - Warehouse Lines; FHA-Authorized Originators 
> - Lending and Servicing in over 45 States 
> www.swmc.com   -  www.simplehecmcalculator.com 
> Visit  www.swmc.com/resources   for helpful links on Training, Webinars, 
Lender Alerts and Submitting Conditions 
> 
> This email and any content within or attached hereto from Sun West 
Mortgage Company, Inc. is confidential and/or legally privileged. The 
information is intended only for the use of the individual or entity named 
on this email. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or taking any action 
in reliance on the contents of this email information is strictly 
prohibited, and that the documents should be returned to this office 
immediately by email. Receipt by anyone other than the intended recipient 
is not a waiver of any privilege. Please do not include your social 
security number, account number, or any other personal or financial 
information in the content of the email. Should you have any questions, 
please call (800) 453 7884. 


-- 
Djigzo open source email encryption



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

Re: Conforming parser

Posted by "martijn.list" <ma...@gmail.com>.
I'm sorry I cannot help you with the startxref issue but I have some
thoughts about parsing non-conforming PDFs.

> Any other suggestions, words of warning, etc.?  Like, how should I
> deal with violations of the spec?

I think it's important to graceful handle non-conforming PDFs. Currently
PDFBox cannot handle certain PDFs that are can be read by most PDF
readers. PDFBox should imho try it's best to cope with PDF errors if
forceParsing is enabled.

I have added a JIRA entry
(https://issues.apache.org/jira/browse/PDFBOX-908) which contains some
patches to make PDFBox parse a large batch of commercial ebooks. I have
added a couple of PDF example to the JIRA entry that try to mimic the
problems I found in read life ebooks. The example PDFs cannot always be
opened by Acrobat because they are created by me using a text editor.
The problems that were replicated were copied from PDFs that could be
opened by Acrobat.

What I think is important is that in case of an exception, the parser
should not unread the data. If data is unread when an exception occurs
the parser can get stuck in an unlimited loop (for example
test-integer-too-large.pdf results in unlimited loop on current PDFBox).

Kind regards,

Martijn Brinkers


On 12/07/2010 07:14 PM, Adam@swmc.com wrote:
> I'm trying to write a conforming parser, which should help out with 
> various issues, and I'm hoping that someone can help me understand the PDF 
> spec so I can get this done exactly to the specifications.
> 
> I noticed in 7.5.5 of ISO 32000-1:2008 it says that the startxref location 
> is the byte-offset from "the decoded stream".  This seems strange that it 
> would be the *decoded* position if the first thing to do is to skip to the 
> end of the file and read the EOF flag, xref location and trailer info. 
> Does this mean that the expected process would be to read and decode the 
> entire stream and write it to a temp file (or hold it in memory) before 
> skipping to the end, reading the EOF flag, etc.?
> 
> If this is correct, I'll just read in the File/InputStream/URL/URI/etc. 
> and decode/write it to a RandomAccess object.  This should keep memory 
> usage low since I'll be working off the RandomAccess object, so a 500MB 
> PDF won't require 500MB of memory (and I have dealt with PDFs this large).
> 
> Finally, as a test, I ran WriteDecodedDoc on my test document and then I 
> expected the xref table to match up, but it still wasn't pointing to the 
> location I expected.  Is there any existing code in PDFBox which would 
> help me read/decode/write a PDF?
> 
> Any other suggestions, words of warning, etc.?  Like, how should I deal 
> with violations of the spec?  Log and ignore, throw exception, have an 
> object which deals with exceptions on a case-by-case basis?  It'd be 
> pretty cool to have an object which would be smart enough to look and see 
> "Read: '%%EO'; Expected: '%%EOF'" and not throw an exception, but if it 
> were "Read: 'obj 49 0'; Expected: '%%EOF'" it might throw an exception. 
> But I'm not going to go through the work of doing all that unless people 
> will actually find it useful.  Maybe the conforming PDF parser could just 
> throw an exception for non-conforming documents and then fall back to the 
> PDFParser?  I'm looking for input from the community here.  Let me know 
> what you think.
> 
> ---- 
> Thanks,
> Adam
> 
> 
> 
> - FHA 203b; 203k; HECM; VA; USDA; Conventional 
> - Warehouse Lines; FHA-Authorized Originators 
> - Lending and Servicing in over 45 States 
> www.swmc.com   -  www.simplehecmcalculator.com   
> Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
> 
> This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  


-- 
Djigzo open source email encryption