You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2010/08/24 00:12:22 UTC

[jira] Created: (PDFBOX-798) Better handle out of spec PDFs

Better handle out of spec PDFs
------------------------------

                 Key: PDFBOX-798
                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
            Reporter: Adam Nichols
            Assignee: Adam Nichols
             Fix For: 1.3.0


I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360
endstream
endobj

There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.

However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:

% end obj, without the endstream
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360
endobj

% end endstream, without the endobj
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360
endstream

% properly ended array, dictionary and object (aka conforming PDF)
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360 ]
>>
endobj


Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-798) Better handle out of spec PDFs

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols updated PDFBOX-798:
--------------------------------

    Issue Type: Improvement  (was: Bug)

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-798) Better handle out of spec PDFs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901739#action_12901739 ] 

Andreas Lehmkühler commented on PDFBOX-798:
-------------------------------------------

What about using BaseParser.readUntilEndStream()? It looks similar to your changes.

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-798) Better handle out of spec PDFs

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols resolved PDFBOX-798.
---------------------------------

    Resolution: Fixed

Committed in revision 989843

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-798) Better handle out of spec PDFs

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols updated PDFBOX-798:
--------------------------------

    Attachment: PDFBOX-798.patch

If anyone has a better idea than a series of nested IFs, please let me know.  I just didn't want to read the whole line in case it's not end[obj|stream].

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-798) Better handle out of spec PDFs

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901967#action_12901967 ] 

Adam Nichols commented on PDFBOX-798:
-------------------------------------

Thanks for pointing out readUntilEndStream(), I didn't know about this one.  The main problem with readUntilEndStream() is that it doesn't stop only when it get to the end of file, or when it finds endstream or endobj.  I want to stop at those cases, but also if a '/' or '>' is found.  If there's a '/' that means we have more dictionary items to read.  Once we hit a >, we know we're at the end of the dictionary and we may find that there's a stream after our dictionary which we just inadvertently discarded.  However, I'll take some lessons from readUntilEndStream() and use the constants instead of the letters.

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason.  This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened.  However, I do know that Adobe Reader parses it without crashing.  Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj".  I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing.  In addition to the above object, I also tested it with these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java.  If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.