You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Ad...@swmc.com on 2010/08/18 21:36:50 UTC

Parsing issues & duplicate objects

I'm trying to find some solution to the problem of documents which have 
multiple objects with the same object ID and revision number.  I have some 
documents which cause NPE and hence the documents can not be merged.  I 
realize these are out of spec, but when the files are opened with Adobe 
Reader, they are rendered just fine.  So (non-technical) people figure if 
Adobe Reader can read it, why can't our software deal with it?

I found some code in COSObject::setObject() which seems to take a crack at 
solving this, but it's all commented out.  I uncommented it hoping it 
would magically solve my problems, but there was no such luck.  Does 
anyone know who wrote that code so I can collaborate with them (SVN 
history didn't have anything)?

According to Neil[1], the best thing to do would be to rewrite the parser. 
 I'm not beyond rewriting the parser if that will solve my issue.  But I 
need to understand how it currently works and how it should work before I 
can take on something like that.  I noticed that section 7.5.5 (File 
Trailer) of the PDF spec says "Conforming readers should read a PDF file 
from its end." and I'm pretty sure PDFParser::parse() doesn't do that.

Anyone think looking at the COSObject will be any faster than rewriting 
the parser?

The documents I have are all confidential, so unfortunately I can't share 
them, but there are some other[1] issues[2] which seem to be somewhat 
related.  I'm going to keep looking for a file I can get approved for 
release so I can upload it to JIRA with an exact stacktrace and 
everything.

[1] https://issues.apache.org/jira/browse/PDFBOX-569
[2] https://issues.apache.org/jira/browse/PDFBOX-720

---- 
Thanks,
Adam


?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.