You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkühler <an...@lehmi.de> on 2010/08/19 09:48:18 UTC

Re: Parsing issues & duplicate objects

Hi

> I'm trying to find some solution to the problem of documents which have 
> multiple objects with the same object ID and revision number.  I have some 
> documents which cause NPE and hence the documents can not be merged.  I 
> realize these are out of spec, but when the files are opened with Adobe 
> Reader, they are rendered just fine.  So (non-technical) people figure if 
> Adobe Reader can read it, why can't our software deal with it?
That's a very popular argument among (non-technical) people ... 

> I found some code in COSObject::setObject() which seems to take a crack at 
> solving this, but it's all commented out.  I uncommented it hoping it 
> would magically solve my problems, but there was no such luck.  Does 
> anyone know who wrote that code so I can collaborate with them (SVN 
> history didn't have anything)?
It seems to be pre Apache. I guess presumably Ben wrote that piece of code.

> According to Neil[1], the best thing to do would be to rewrite the parser. 
>  I'm not beyond rewriting the parser if that will solve my issue.  But I 
> need to understand how it currently works and how it should work before I 
> can take on something like that.  I noticed that section 7.5.5 (File 
> Trailer) of the PDF spec says "Conforming readers should read a PDF file 
> from its end." and I'm pretty sure PDFParser::parse() doesn't do that.
I guess those pdfs aren't out of spec, they just contain incremental updates.
Those updates (added, deletee or changed content) are appended to the end
of the document. The XRef section is somehow used to handle those updates.
If you want to get the most recent version of your pdf, you have to start at the 
end to determine if there are any updates or not. Have a look in section 
7.5.6 (Incremental updates) ofr further details.

> Anyone think looking at the COSObject will be any faster than rewriting 
> the parser?
A few weeks ago I invested some time in incremental updates but I didn't find
a solution. I think our parser isn't that bad and it should be possible to improve
it to handle incremental updates.

> The documents I have are all confidential, so unfortunately I can't share 
> them, but there are some other[1] issues[2] which seem to be somewhat 
> related.  I'm going to keep looking for a file I can get approved for 
> release so I can upload it to JIRA with an exact stacktrace and 
> everything.
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-569
> [2] https://issues.apache.org/jira/browse/PDFBOX-720
Especially [2] gives some good pointers.

BR
Andreas Lehmkühler


Re: Parsing issues & duplicate objects

Posted by Ad...@swmc.com.
Thanks for the response Andreas.  I read through the section of the PDF 
spec on the tailer and xref table, and I think I understand it all. 
However, I'm coming across a situation where I don't know how it's 
supposed to be parsed.  It's a file which has been incrementally updated 
one time.  The xref table for the original document says that object 2, 
generation 0 should be 130 bytes into the file.  It is, and here's the 
definition:
2 0 obj<</Type/Pages/Kids[3 0 R 4 0 R]/Count 17>>
endobj

However, the second xref table says that 2, 0 is at byte offset 775205. 
This is also true and here's what I find at that location:
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R 4 0 R ]
/Count 17
/Parent 218 0 R
>>
endobj

Now, in this case, they're both Pages and the Kids and the Count match, so 
it should not make much difference in how it's handled, but my question 
is: had they been different, how should that have been handled?  I can say 
for certain that the current PDFBox code takes the first one it finds 
(technically, it is overwritten with the second one, but then 
PDFParser::resolveConflicts() puts it back).  I have a hunch the last one 
if the one we really want, as that should have everything the old one had, 
the latest revisions, and plus possibly more items in the dictionary.

I think the issue I'm facing is unrelated to the parser not starting at 
the bottom and following the xref tables.  I found out what is happening 
with my file...  document.dereferenceObjectStreams(); (line 207 of 
PDFParser.java) overwrites the correct data with some junk which doesn't 
belong there.  If I comment out this line, everything works fine.  I 
believe the proper solution will be to not overwrite existing nodes, but 
only to add new ones.  This is what is done when a duplicate ID is found 
outside of a stream (technically it's overwritten and then restored by the 
conflict resolver, but the effect is the same).  I hope to have enough 
time to make that change and test it out tomorrow.  I'll make a JIRA issue 
once I get to the bottom of it.

---- 
Thanks,
Adam





From:
"Andreas Lehmkühler" <an...@lehmi.de>
To:
dev@pdfbox.apache.org
Date:
08/19/2010 00:49
Subject:
Re: Parsing issues &amp; duplicate objects



Hi

> I'm trying to find some solution to the problem of documents which have 
> multiple objects with the same object ID and revision number.  I have 
some 
> documents which cause NPE and hence the documents can not be merged.  I 
> realize these are out of spec, but when the files are opened with Adobe 
> Reader, they are rendered just fine.  So (non-technical) people figure 
if 
> Adobe Reader can read it, why can't our software deal with it?
That's a very popular argument among (non-technical) people ... 

> I found some code in COSObject::setObject() which seems to take a crack 
at 
> solving this, but it's all commented out.  I uncommented it hoping it 
> would magically solve my problems, but there was no such luck.  Does 
> anyone know who wrote that code so I can collaborate with them (SVN 
> history didn't have anything)?
It seems to be pre Apache. I guess presumably Ben wrote that piece of 
code.

> According to Neil[1], the best thing to do would be to rewrite the 
parser. 
>  I'm not beyond rewriting the parser if that will solve my issue.  But I 

> need to understand how it currently works and how it should work before 
I 
> can take on something like that.  I noticed that section 7.5.5 (File 
> Trailer) of the PDF spec says "Conforming readers should read a PDF file 

> from its end." and I'm pretty sure PDFParser::parse() doesn't do that.
I guess those pdfs aren't out of spec, they just contain incremental 
updates.
Those updates (added, deletee or changed content) are appended to the end
of the document. The XRef section is somehow used to handle those updates.
If you want to get the most recent version of your pdf, you have to start 
at the 
end to determine if there are any updates or not. Have a look in section 
7.5.6 (Incremental updates) ofr further details.

> Anyone think looking at the COSObject will be any faster than rewriting 
> the parser?
A few weeks ago I invested some time in incremental updates but I didn't 
find
a solution. I think our parser isn't that bad and it should be possible to 
improve
it to handle incremental updates.

> The documents I have are all confidential, so unfortunately I can't 
share 
> them, but there are some other[1] issues[2] which seem to be somewhat 
> related.  I'm going to keep looking for a file I can get approved for 
> release so I can upload it to JIRA with an exact stacktrace and 
> everything.
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-569
> [2] https://issues.apache.org/jira/browse/PDFBOX-720
Especially [2] gives some good pointers.

BR
Andreas Lehmkühler




?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.