You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkühler <an...@lehmi.de> on 2010/08/19 09:48:18 UTC
Re: Parsing issues & duplicate objects
Hi
> I'm trying to find some solution to the problem of documents which have
> multiple objects with the same object ID and revision number. I have some
> documents which cause NPE and hence the documents can not be merged. I
> realize these are out of spec, but when the files are opened with Adobe
> Reader, they are rendered just fine. So (non-technical) people figure if
> Adobe Reader can read it, why can't our software deal with it?
That's a very popular argument among (non-technical) people ...
> I found some code in COSObject::setObject() which seems to take a crack at
> solving this, but it's all commented out. I uncommented it hoping it
> would magically solve my problems, but there was no such luck. Does
> anyone know who wrote that code so I can collaborate with them (SVN
> history didn't have anything)?
It seems to be pre Apache. I guess presumably Ben wrote that piece of code.
> According to Neil[1], the best thing to do would be to rewrite the parser.
> I'm not beyond rewriting the parser if that will solve my issue. But I
> need to understand how it currently works and how it should work before I
> can take on something like that. I noticed that section 7.5.5 (File
> Trailer) of the PDF spec says "Conforming readers should read a PDF file
> from its end." and I'm pretty sure PDFParser::parse() doesn't do that.
I guess those pdfs aren't out of spec, they just contain incremental updates.
Those updates (added, deletee or changed content) are appended to the end
of the document. The XRef section is somehow used to handle those updates.
If you want to get the most recent version of your pdf, you have to start at the
end to determine if there are any updates or not. Have a look in section
7.5.6 (Incremental updates) ofr further details.
> Anyone think looking at the COSObject will be any faster than rewriting
> the parser?
A few weeks ago I invested some time in incremental updates but I didn't find
a solution. I think our parser isn't that bad and it should be possible to improve
it to handle incremental updates.
> The documents I have are all confidential, so unfortunately I can't share
> them, but there are some other[1] issues[2] which seem to be somewhat
> related. I'm going to keep looking for a file I can get approved for
> release so I can upload it to JIRA with an exact stacktrace and
> everything.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-569
> [2] https://issues.apache.org/jira/browse/PDFBOX-720
Especially [2] gives some good pointers.
BR
Andreas Lehmkühler
Re: Parsing issues & duplicate objects
Posted by Ad...@swmc.com.
Thanks for the response Andreas. I read through the section of the PDF
spec on the tailer and xref table, and I think I understand it all.
However, I'm coming across a situation where I don't know how it's
supposed to be parsed. It's a file which has been incrementally updated
one time. The xref table for the original document says that object 2,
generation 0 should be 130 bytes into the file. It is, and here's the
definition:
2 0 obj<</Type/Pages/Kids[3 0 R 4 0 R]/Count 17>>
endobj
However, the second xref table says that 2, 0 is at byte offset 775205.
This is also true and here's what I find at that location:
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R 4 0 R ]
/Count 17
/Parent 218 0 R
>>
endobj
Now, in this case, they're both Pages and the Kids and the Count match, so
it should not make much difference in how it's handled, but my question
is: had they been different, how should that have been handled? I can say
for certain that the current PDFBox code takes the first one it finds
(technically, it is overwritten with the second one, but then
PDFParser::resolveConflicts() puts it back). I have a hunch the last one
if the one we really want, as that should have everything the old one had,
the latest revisions, and plus possibly more items in the dictionary.
I think the issue I'm facing is unrelated to the parser not starting at
the bottom and following the xref tables. I found out what is happening
with my file... document.dereferenceObjectStreams(); (line 207 of
PDFParser.java) overwrites the correct data with some junk which doesn't
belong there. If I comment out this line, everything works fine. I
believe the proper solution will be to not overwrite existing nodes, but
only to add new ones. This is what is done when a duplicate ID is found
outside of a stream (technically it's overwritten and then restored by the
conflict resolver, but the effect is the same). I hope to have enough
time to make that change and test it out tomorrow. I'll make a JIRA issue
once I get to the bottom of it.
----
Thanks,
Adam
From:
"Andreas Lehmkühler" <an...@lehmi.de>
To:
dev@pdfbox.apache.org
Date:
08/19/2010 00:49
Subject:
Re: Parsing issues & duplicate objects
Hi
> I'm trying to find some solution to the problem of documents which have
> multiple objects with the same object ID and revision number. I have
some
> documents which cause NPE and hence the documents can not be merged. I
> realize these are out of spec, but when the files are opened with Adobe
> Reader, they are rendered just fine. So (non-technical) people figure
if
> Adobe Reader can read it, why can't our software deal with it?
That's a very popular argument among (non-technical) people ...
> I found some code in COSObject::setObject() which seems to take a crack
at
> solving this, but it's all commented out. I uncommented it hoping it
> would magically solve my problems, but there was no such luck. Does
> anyone know who wrote that code so I can collaborate with them (SVN
> history didn't have anything)?
It seems to be pre Apache. I guess presumably Ben wrote that piece of
code.
> According to Neil[1], the best thing to do would be to rewrite the
parser.
> I'm not beyond rewriting the parser if that will solve my issue. But I
> need to understand how it currently works and how it should work before
I
> can take on something like that. I noticed that section 7.5.5 (File
> Trailer) of the PDF spec says "Conforming readers should read a PDF file
> from its end." and I'm pretty sure PDFParser::parse() doesn't do that.
I guess those pdfs aren't out of spec, they just contain incremental
updates.
Those updates (added, deletee or changed content) are appended to the end
of the document. The XRef section is somehow used to handle those updates.
If you want to get the most recent version of your pdf, you have to start
at the
end to determine if there are any updates or not. Have a look in section
7.5.6 (Incremental updates) ofr further details.
> Anyone think looking at the COSObject will be any faster than rewriting
> the parser?
A few weeks ago I invested some time in incremental updates but I didn't
find
a solution. I think our parser isn't that bad and it should be possible to
improve
it to handle incremental updates.
> The documents I have are all confidential, so unfortunately I can't
share
> them, but there are some other[1] issues[2] which seem to be somewhat
> related. I'm going to keep looking for a file I can get approved for
> release so I can upload it to JIRA with an exact stacktrace and
> everything.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-569
> [2] https://issues.apache.org/jira/browse/PDFBOX-720
Especially [2] gives some good pointers.
BR
Andreas Lehmkühler
? Click here to submit conditions
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.