You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (Created) (JIRA)" <ji...@apache.org> on 2011/11/10 18:32:51 UTC

[jira] [Created] (PDFBOX-1164) Inline image parsing error causes RuntimeException + FIX

Inline image parsing error causes RuntimeException + FIX
--------------------------------------------------------

                 Key: PDFBOX-1164
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1164
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.7.0
            Reporter: Timo Boehme


Inline images start with BI operator, followed by some parameters and ID operator. Then the binary image data with a trailing EI operator follows. The problem is the detection of the EI operator. The current code in PDFStreamParser requires the operator to be surrounded by whitespaces. However I have a document where the sequence EI with preceding 0x09 and following 0x20 occurs in the image data. Thus PDFBOX wrongly assumes the end of image data and the parsing later fails with a RuntimeException (from PDFStreamParser#getTokenIterator - this should be changed to throw IOException; will file another issue) because the following binary data is interpreted as operator.

In earlier versions a heuristic was used to test the expected byte count of the image to circumvent this problem, however it was disabled because the data could also be compressed.

To fix the problem I have added a test involving the following X (with X=5) bytes after the 'WS EI WS'. In order to treat the EI as operator all of the bytes must be printable ASCII characters because it can only be followed by PDF operators. If 5 bytes are too many because a comment with non ASCII character could follow this could be reduced to 3 bytes which in most cases should be enough.

Diff of fix is added to this issue.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1164) Inline image parsing error causes RuntimeException + FIX

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1164:
--------------------------------

    Attachment: PDFStreamParser.diff

fix for this issue
                
> Inline image parsing error causes RuntimeException + FIX
> --------------------------------------------------------
>
>                 Key: PDFBOX-1164
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1164
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.7.0
>            Reporter: Timo Boehme
>         Attachments: PDFStreamParser.diff
>
>
> Inline images start with BI operator, followed by some parameters and ID operator. Then the binary image data with a trailing EI operator follows. The problem is the detection of the EI operator. The current code in PDFStreamParser requires the operator to be surrounded by whitespaces. However I have a document where the sequence EI with preceding 0x09 and following 0x20 occurs in the image data. Thus PDFBOX wrongly assumes the end of image data and the parsing later fails with a RuntimeException (from PDFStreamParser#getTokenIterator - this should be changed to throw IOException; will file another issue) because the following binary data is interpreted as operator.
> In earlier versions a heuristic was used to test the expected byte count of the image to circumvent this problem, however it was disabled because the data could also be compressed.
> To fix the problem I have added a test involving the following X (with X=5) bytes after the 'WS EI WS'. In order to treat the EI as operator all of the bytes must be printable ASCII characters because it can only be followed by PDF operators. If 5 bytes are too many because a comment with non ASCII character could follow this could be reduced to 3 bytes which in most cases should be enough.
> Diff of fix is added to this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira