You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2014/07/26 17:01:45 UTC

[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed

     [ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr updated PDFBOX-2163:
------------------------------------

    Attachment: PDFBOX-2163-029016.pdf

The attached file has this:
{code}
EI<NL>DB'Z[<TAB>8F 
{code}
so the part after EI was considered as "not binary". So I have improved the code once again, requiring that the "not binary" part (which I have set to 10 bytes now) must have 1-3 non space characters after the end of EI and space characters. This is probably still not the end of it, the next step would be to require that the non-space character sequence be a valid PDF operator. This was done in rev 1613645 for the trunk and rev 1613646 for the 1.8 branch.

> inline image with EI in the middle incorrectly parsed
> -----------------------------------------------------
>
>                 Key: PDFBOX-2163
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.6, 1.8.7, 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>              Labels: inline
>             Fix For: 1.8.7, 2.0.0
>
>         Attachments: PDFBOX-2163-029016.pdf
>
>
> This PDF
> http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
> has an exception because the end of an inline image is improperly detected. The stream looks like this:
> {code}
> BI
>   /W 452
>   /H 169
>   /BPC 8
>   /CS /RGB
>   /D [0.0 1.0 0.0 1.0 0.0 1.0]
>   /F [/A85 /Fl]
> ID
> ......................................................
> ....................................................EI
> ......................................................
> ...
> ....
> EI Q
> {code}
> The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)