You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Martin Tappler <ma...@gmail.com> on 2014/08/07 11:44:24 UTC

Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Hi,

I am looking at PDF files at COS level and I found that the current
implementation of the non-sequential parser does not provide support for
hybrid cross references (which was discussed before in this mailing list).

Look at the PDF file at [1] for instance. It contains a structure tree,
which is hidden in the hybrid-reference file (actually such an example
is also described in the PDF reference section 3.4.7. under
"Compatibility with applications that do not support PDF 1.5."). The
root of the structure tree is the object with object number 28 and
generation number 0 and is contained in an object stream, which is only
referenced in the cross reference stream, which is not parsed by the
current implementation.

I used version 1.8.6. from the maven repository and also the latest
source version from the trunk to reproduce this behavior.

However, I came up with a fix which works for me and which should not
break anything. After parsing the cross reference table and the trailer,
the trailer should be checked for an "XrefStm" entry. If this entry is
present, the stream at the given offset should be parsed using
parseXrefObjStream, but with the offset of the cross reference table as
argument (this is done to ensure that the resolving process works as
expected). This replaces the recently parsed information (table and
trailer) in the XrefTrailerResolver, which should be stored in temporary
variables. After this is done, the information contained in the cross
reference stream is updated with the old trailer and the cross reference
table information. According to the PDF spec, this should not be needed,
but makes the parsing more robust, since there might be files, which
store information in the table, but not in the stream. So this ensures
that no information is lost.

Please find patches for the fix attached. I hope they are useful.

Best regards,
Martin Tappler

[1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf

Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 08.08.2014 08:57, schrieb Martin Tappler:
> On 08/07/2014 06:57 PM, John Hewson wrote:
>> Thanks, that's very useful!
> Great!
>>
>> My only concern would be:
>>
>>> According to the PDF spec, this should not be needed,
>>> but makes the parsing more robust, since there might be files, which
>>> store information in the table, but not in the stream.
>>
>> We generally don't implement hypothetical aspects of non-spec PDFs without an example file,
>
> Using my test files, I just checked if it is really unnecessary, i.e. if
> parsing without adding cross reference table information leads to the
> same result. It is actually necessary, see for instance the PDF file at
> [1]. Since it seems to be from Adobe, I most probably have
> misinterpreted the PDF spec, sorry. The PDF files is linearized and
> contains an XrefStm entry in the first trailer and the corresponding
> cross reference streams contains only references to objects inside an
> object stream, while the catalog reference is only contained in the
> cross reference table.
You were right in the first place. The cross reference stream may contain some
data which is already present in the xref table, but there could be some data
which are only available through the stream, e.g. offset of objects within
object streams.

I've added support for hybrid-references and we'll see if there is some room
for improvements regarding redundant reading of object ids.

@Martin: Thanks for the sample and the input.

> Best regards,
> Martin

BR
Andreas Lehmkühler

>
> [1]
> http://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf
>
>>
>> -- John
>>
>>> On 7 Aug 2014, at 02:44, Martin Tappler <ma...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I am looking at PDF files at COS level and I found that the current
>>> implementation of the non-sequential parser does not provide support for
>>> hybrid cross references (which was discussed before in this mailing list).
>>>
>>> Look at the PDF file at [1] for instance. It contains a structure tree,
>>> which is hidden in the hybrid-reference file (actually such an example
>>> is also described in the PDF reference section 3.4.7. under
>>> "Compatibility with applications that do not support PDF 1.5."). The
>>> root of the structure tree is the object with object number 28 and
>>> generation number 0 and is contained in an object stream, which is only
>>> referenced in the cross reference stream, which is not parsed by the
>>> current implementation.
>>>
>>> I used version 1.8.6. from the maven repository and also the latest
>>> source version from the trunk to reproduce this behavior.
>>>
>>> However, I came up with a fix which works for me and which should not
>>> break anything. After parsing the cross reference table and the trailer,
>>> the trailer should be checked for an "XrefStm" entry. If this entry is
>>> present, the stream at the given offset should be parsed using
>>> parseXrefObjStream, but with the offset of the cross reference table as
>>> argument (this is done to ensure that the resolving process works as
>>> expected). This replaces the recently parsed information (table and
>>> trailer) in the XrefTrailerResolver, which should be stored in temporary
>>> variables. After this is done, the information contained in the cross
>>> reference stream is updated with the old trailer and the cross reference
>>> table information. According to the PDF spec, this should not be needed,
>>> but makes the parsing more robust, since there might be files, which
>>> store information in the table, but not in the stream. So this ensures
>>> that no information is lost.
>>>
>>> Please find patches for the fix attached. I hope they are useful.
>>>
>>> Best regards,
>>> Martin Tappler
>>>
>>> [1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
>>> <NonSequentialPDFParser.patch>
>>> <XrefTrailerResolver.patch>

Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Posted by Martin Tappler <ma...@gmail.com>.

On 08/07/2014 06:57 PM, John Hewson wrote:
> Thanks, that's very useful!
Great!
> 
> My only concern would be:
> 
>> According to the PDF spec, this should not be needed,
>> but makes the parsing more robust, since there might be files, which
>> store information in the table, but not in the stream.
> 
> We generally don't implement hypothetical aspects of non-spec PDFs without an example file, 

Using my test files, I just checked if it is really unnecessary, i.e. if
parsing without adding cross reference table information leads to the
same result. It is actually necessary, see for instance the PDF file at
[1]. Since it seems to be from Adobe, I most probably have
misinterpreted the PDF spec, sorry. The PDF files is linearized and
contains an XrefStm entry in the first trailer and the corresponding
cross reference streams contains only references to objects inside an
object stream, while the catalog reference is only contained in the
cross reference table.

Best regards,
Martin

[1]
http://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf

> 
> -- John
> 
>> On 7 Aug 2014, at 02:44, Martin Tappler <ma...@gmail.com> wrote:
>>
>> Hi,
>>
>> I am looking at PDF files at COS level and I found that the current
>> implementation of the non-sequential parser does not provide support for
>> hybrid cross references (which was discussed before in this mailing list).
>>
>> Look at the PDF file at [1] for instance. It contains a structure tree,
>> which is hidden in the hybrid-reference file (actually such an example
>> is also described in the PDF reference section 3.4.7. under
>> "Compatibility with applications that do not support PDF 1.5."). The
>> root of the structure tree is the object with object number 28 and
>> generation number 0 and is contained in an object stream, which is only
>> referenced in the cross reference stream, which is not parsed by the
>> current implementation.
>>
>> I used version 1.8.6. from the maven repository and also the latest
>> source version from the trunk to reproduce this behavior.
>>
>> However, I came up with a fix which works for me and which should not
>> break anything. After parsing the cross reference table and the trailer,
>> the trailer should be checked for an "XrefStm" entry. If this entry is
>> present, the stream at the given offset should be parsed using
>> parseXrefObjStream, but with the offset of the cross reference table as
>> argument (this is done to ensure that the resolving process works as
>> expected). This replaces the recently parsed information (table and
>> trailer) in the XrefTrailerResolver, which should be stored in temporary
>> variables. After this is done, the information contained in the cross
>> reference stream is updated with the old trailer and the cross reference
>> table information. According to the PDF spec, this should not be needed,
>> but makes the parsing more robust, since there might be files, which
>> store information in the table, but not in the stream. So this ensures
>> that no information is lost.
>>
>> Please find patches for the fix attached. I hope they are useful.
>>
>> Best regards,
>> Martin Tappler
>>
>> [1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
>> <NonSequentialPDFParser.patch>
>> <XrefTrailerResolver.patch>

Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Posted by John Hewson <jo...@jahewson.com>.

Thanks, that's very useful!

My only concern would be:

> According to the PDF spec, this should not be needed,
> but makes the parsing more robust, since there might be files, which
> store information in the table, but not in the stream.

We generally don't implement hypothetical aspects of non-spec PDFs without an example file, 

-- John

> On 7 Aug 2014, at 02:44, Martin Tappler <ma...@gmail.com> wrote:
> 
> Hi,
> 
> I am looking at PDF files at COS level and I found that the current
> implementation of the non-sequential parser does not provide support for
> hybrid cross references (which was discussed before in this mailing list).
> 
> Look at the PDF file at [1] for instance. It contains a structure tree,
> which is hidden in the hybrid-reference file (actually such an example
> is also described in the PDF reference section 3.4.7. under
> "Compatibility with applications that do not support PDF 1.5."). The
> root of the structure tree is the object with object number 28 and
> generation number 0 and is contained in an object stream, which is only
> referenced in the cross reference stream, which is not parsed by the
> current implementation.
> 
> I used version 1.8.6. from the maven repository and also the latest
> source version from the trunk to reproduce this behavior.
> 
> However, I came up with a fix which works for me and which should not
> break anything. After parsing the cross reference table and the trailer,
> the trailer should be checked for an "XrefStm" entry. If this entry is
> present, the stream at the given offset should be parsed using
> parseXrefObjStream, but with the offset of the cross reference table as
> argument (this is done to ensure that the resolving process works as
> expected). This replaces the recently parsed information (table and
> trailer) in the XrefTrailerResolver, which should be stored in temporary
> variables. After this is done, the information contained in the cross
> reference stream is updated with the old trailer and the cross reference
> table information. According to the PDF spec, this should not be needed,
> but makes the parsing more robust, since there might be files, which
> store information in the table, but not in the stream. So this ensures
> that no information is lost.
> 
> Please find patches for the fix attached. I hope they are useful.
> 
> Best regards,
> Martin Tappler
> 
> [1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
> <NonSequentialPDFParser.patch>
> <XrefTrailerResolver.patch>

Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi

> Martin Tappler <ma...@gmail.com> hat am 7. August 2014 um 11:44
> geschrieben:
>
>
> Hi,
>
> I am looking at PDF files at COS level and I found that the current
> implementation of the non-sequential parser does not provide support for
> hybrid cross references (which was discussed before in this mailing list).
>
> Look at the PDF file at [1] for instance. It contains a structure tree,
> which is hidden in the hybrid-reference file (actually such an example
> is also described in the PDF reference section 3.4.7. under
> "Compatibility with applications that do not support PDF 1.5."). The
> root of the structure tree is the object with object number 28 and
> generation number 0 and is contained in an object stream, which is only
> referenced in the cross reference stream, which is not parsed by the
> current implementation.
>
> I used version 1.8.6. from the maven repository and also the latest
> source version from the trunk to reproduce this behavior.
>
> However, I came up with a fix which works for me and which should not
> break anything. After parsing the cross reference table and the trailer,
> the trailer should be checked for an "XrefStm" entry. If this entry is
> present, the stream at the given offset should be parsed using
> parseXrefObjStream, but with the offset of the cross reference table as
> argument (this is done to ensure that the resolving process works as
> expected). This replaces the recently parsed information (table and
> trailer) in the XrefTrailerResolver, which should be stored in temporary
> variables. After this is done, the information contained in the cross
> reference stream is updated with the old trailer and the cross reference
> table information. According to the PDF spec, this should not be needed,
> but makes the parsing more robust, since there might be files, which
> store information in the table, but not in the stream. So this ensures
> that no information is lost.
>
> Please find patches for the fix attached. I hope they are useful.
Thanks for the contribution!

I didn't had a deeper look, but yes, I guess I'll be useful as I already
stumbled upon that missing feature as well in conjunction with
PDFBOX-2250 [1].

Saying that, I'll take care about that.

> Best regards,
> Martin Tappler

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-2250