You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2010/08/26 19:49:56 UTC

[jira] Resolved: (PDFBOX-798) Better handle out of spec PDFs

     [ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols resolved PDFBOX-798.
---------------------------------

    Resolution: Fixed

Committed in revision 989843

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.