You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2011/06/23 19:21:47 UTC

[jira] [Issue Comment Edited] (PDFBOX-1016) Specification conform xref/trailer parsing + Fix

    [ https://issues.apache.org/jira/browse/PDFBOX-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053982#comment-13053982 ] 

Andreas Lehmkühler edited comment on PDFBOX-1016 at 6/23/11 5:20 PM:
---------------------------------------------------------------------

I added the patch as proposed in revision 1138995, but I had to add some small changes and an improvement:

- reformated some code
- added some documentation
- added some java 5 stuff (generics)
- using more COSName constants
- fixed PDFParser to avoid a NPE when parsing documents which are containing some unreadable content after %%EOF (the original patch broke our extraction test: data-000001.pdf)

@Timo
Thanks for the contribution

@Thomas
Thanks for the additional test

      was (Author: lehmi):
    I added the patch as proposed in revision 1138995, but I had to add some small changes and an improvement:

- reformated some code
- added some documentation
- added some java 5 stuff (generics)
- using more COSName constants
- fixed PDFParser to avoid a NPE when parsing documents which are containing some unreadable content after %%EOF (the original patch broke our extraction test: data-000001.pdf)
  
> Specification conform xref/trailer parsing + Fix
> ------------------------------------------------
>
>                 Key: PDFBOX-1016
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1016
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.6.0
>
>         Attachments: COSDocument.diff, PDFParser.diff, PDFXrefStreamParser.diff, XrefTrailerResolver.java, XrefTrailerResolver.java
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PDFBOX currently reads xref table/trailer and XRef objects without using startxref or 'Prev' information which results in applying not active data resulting in using wrong objects or resulting in parsing exceptions because old trailer settings do not apply anymore. This happens especially with updated PDF documents where changes are simply appended  and old objects/xref entries remain but are not referenced. My last patch (PDFBOX-1014) tried to solve this for a specific case but it was based on assumptions which do not hold in every case.
> The specification compliant way is to read the last startxref which points to the last xref object which itself may reference further xref objects using 'Prev' attribute.
> I have written a fix which works the standard way and can fall back to the old behavior in case startxref is wrong or missing. The fix tries to be as unobtrusive as possible. A new class (o.a.p.pdfparser.XrefTrailerResolver) is filled with all xref table/trailer and XRef object data. After document is parsed (and last startxref is read) this class creates xref table and trailer using startxref and 'Prev' information. Beside this new class there are small changes to PDFParser and COSDocument.
> This bugfix/improvement should bring PDFBOX a good step closer to be PDF specification conform - especially as long as the new specification conform parser project is not finished.
> This bugfix supersedes the fix from PDFBOX-1014.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira