You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "martijn.list" <ma...@gmail.com> on 2010/11/24 21:51:05 UTC

Image data sometimes contains EI. This however should not end the image data.

I have spent some time investigating why PDFBox fails to parse the PDF
from  https://issues.apache.org/jira/browse/PDFBOX-789.

After a long debugging session I finally found what's wrong with the PDF
(well at least with one part).

The PDF contains multiple inline images. With one of the images it
appears that the image data (the data after the DI token) contains EI
which is an end token. The first EI however, is part of the image and
should not end the image data.

The following snippet shows that there is an EI which is part of the
image data:

BI
/CS/RGB
/W 795
/H 1
/BPC 8
/F/Fl
/DP<</Predictor 15
/Columns 795
/Colors 3>>
ID x<9c>í<95>ÁmÃ0^LEI ...<-- the 'wrong' EI
EI Q <-- the correct EI
q 795 0 0 1 1863 3028.67 cm
BI

The correct EI is separated by a newline (0x0A) and a space (0x20), i.e.
<0x0A>EI<0x20>. The wrong EI is separated by a formfeed (0x0C) and a
newline, i.e. <0x0C>EI<0x0A>

PDFStreamParser already notices that the PDF specs are not really clear:

"PDF spec is kinda unclear about this.  Should a whitespace always
appear before EI?"

Is there something we can do to make this more robust? Should the EI
always end with a space?

Kind regards,

Martijn Brinkers

Re: Image data sometimes contains EI. This however should not end the image data.

Posted by Jukka Zitting <jz...@adobe.com>.

Hi,

On 24/11/10 21:51, martijn.list wrote:
> Is there something we can do to make this more robust?
> Should the EI always end with a space?

I reached out within Adobe to ask about this, and was pointed to 
http://forums.adobe.com/community/design_development/pdf_language_and_specifications 
as a good place to ask questions like this.

The response on this particular issue was such cases should never occur 
in the first place. If they do, it should be possible to detect a "real" 
EI by checking if the next byte is whitespace.

Not sure how that advice applies here, as it looks like both EIs in this 
case are followed by whitespace, one by a newline and the other by a space.

BR,

Jukka Zitting

Re: Image data sometimes contains EI. This however should not end the image data.

Posted by Ad...@swmc.com.

By that logic, what would you do if the last few bytes were 
<0x0D><0x0A>EI<WS>?  Would you condiser that to be a delimiter of 
<0x0D><0x0A>EI<WS> or <0x0A>EI<WS> (with the last byte of the image data 
being <0x0D>)?  I remember reading somewhere else in the spec that it 
limited newlines to being <0x0D> or <0x0A> but not <0x0D><0x0A> because 
it'd be impossible to tell if the <0x0A> is the first byte of data, or 
part of the delimiter.  I think that was in the beginning of stream data, 
if I recall correctly.  It's basically the same issue here, just in 
reverse since we're at the end.

Since the spec if unclear, Adobe thinks it will never happen, and I like 
taking simplest approach, I say we stick to looking for <0x0D|0x0A>EI<WS>. 
 Good catch on me forgetting about the Mac newlines though.  I'm so 
accustomed to Windows and Unix newlines, I forgot that the PDF spec 
recognizes those as well.

---- 
Thanks,
Adam

From:
"martijn.list" <ma...@gmail.com>
To:
dev@pdfbox.apache.org
Date:
11/25/2010 00:51
Subject:
Re: Image data sometimes contains EI. This however should not end the 
image data.

> I'm still a little concerned about a <0x0A>EI<0x20> appearing within
> the Image data though, but it looks like the spec simply doesn't
> allow for that data to be in an Image. I'm not sure how much better
> we can do...

Yes, the specs do not explain how to handle cases where the image data
contains the EI keyword. What's strange is that they do not mention this
anywhere (at least I did not find it). 8.9.7 says:

" The bytes between the ID and EI operators shall be treated the same as
a stream object’s data (see 7.3.8, "Stream Objects"), even though they
do not follow the standard stream syntax."

Streams however should normally contain the length of the stream so in
principle it should be a problem when the data contains the endstream
keyword. The BI dictionary however does not contain the length of the
image data section.

7.3.8 mentions that "There should be an end-of-line marker after the
data...". EOL marker is defined as:

4.20
end-of-line marker (EOL marker)
one or two character sequence marking the end of a line of text,
consisting of a CARRIAGE RETURN character (0Dh) or a LINE FEED character
(0Ah) or a CARRIAGE RETURN followed immediately by a LINE
FEED

So I think your suggestion is correct. However all EOL markers should be
included.

The following EI's should mark the end of the image data (DI):

<0x0D>EI<WS>
<0x0A>EI<WS>
<0x0D><0x0A>EI<WS>

where <WS> is a whitespace.

If the image data contains one of these markers, the image data cannot
be extracted. The PDF specs should imho be more clear how to handle
these exceptional cases.

Kind regards,

Martijn

On 11/25/2010 12:22 AM, Adam@swmc.com wrote:
> I see what you mean about the spec being vague.  In the examples in the 
> PDF, the "EI" is always on its own line (implying a newline).  Given 
this, 
> and the data you found, I'd like to propose that we look for "<0x0A>EI" 
> followed by whitespace (I'm choosing whitespace simply because that's 
what 
> the PDF spec says should come after ID in section 8.9.7.  Although they 
> don't explicitly say what should follow EI, it'd make sense to be 
> consistent with the ID operator).  Whitespace is defined in the PDF spec 

> in table 1 (Section 7.2.2 Character Set) as 0x00 0x09 0x0A 0x0C 0x0D or 
> 0x20.
> 
> It seems like this would take care of the PDF in question and be a 
> reasonable way to interpret the spec.
> 
> I'm still a little concerned about a <0x0A>EI<0x20> appearing within the 

> Image data though, but it looks like the spec simply doesn't allow for 
> that data to be in an Image.  I'm not sure how much better we can do...
> 
> For reference, I'm using ISO32000-1:2008 as "the PDF spec".
> 
> ---- 
> Thanks,
> Adam
> 
> 
> 
> From:
> "martijn.list" <ma...@gmail.com>
> To:
> dev@pdfbox.apache.org
> Date:
> 11/24/2010 12:51
> Subject:
> Image data sometimes contains EI. This however should not end the image 
> data.
> 
> 
> 
> I have spent some time investigating why PDFBox fails to parse the PDF
> from  https://issues.apache.org/jira/browse/PDFBOX-789.
> 
> After a long debugging session I finally found what's wrong with the PDF
> (well at least with one part).
> 
> The PDF contains multiple inline images. With one of the images it
> appears that the image data (the data after the DI token) contains EI
> which is an end token. The first EI however, is part of the image and
> should not end the image data.
> 
> The following snippet shows that there is an EI which is part of the
> image data:
> 
> BI
> /CS/RGB
> /W 795
> /H 1
> /BPC 8
> /F/Fl
> /DP<</Predictor 15
> /Columns 795
> /Colors 3>>
> ID x<9c>í<95>ÁmÃ0^LEI ...<-- the 'wrong' EI
> EI Q <-- the correct EI
> q 795 0 0 1 1863 3028.67 cm
> BI
> 
> The correct EI is separated by a newline (0x0A) and a space (0x20), i.e.
> <0x0A>EI<0x20>. The wrong EI is separated by a formfeed (0x0C) and a
> newline, i.e. <0x0C>EI<0x0A>
> 
> PDFStreamParser already notices that the PDF specs are not really clear:
> 
> "PDF spec is kinda unclear about this.  Should a whitespace always
> appear before EI?"
> 
> Is there something we can do to make this more robust? Should the EI
> always end with a space?
> 
> Kind regards,
> 
> Martijn Brinkers
> 
> 
> 
> - FHA 203b; 203k; HECM; VA; USDA; Conventional 
> - Warehouse Lines; FHA-Authorized Originators 
> - Lending and Servicing in over 45 States 
> www.swmc.com   -  www.simplehecmcalculator.com   Visit  
www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions 
> This email and any content within or attached hereto from Sun West 
Mortgage Company, Inc. is confidential and/or legally privileged. The 
information is intended only for the use of the individual or entity named 
on this email. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or taking any action 
in reliance on the contents of this email information is strictly 
prohibited, and that the documents should be returned to this office 
immediately by email. Receipt by anyone other than the intended recipient 
is not a waiver of any privilege. Please do not include your social 
security number, account number, or any other personal or financial 
information in the content of the email. Should you have any questions, 
please call (800) 453 7884. 

-- 
Djigzo open source email encryption

- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.

Re: Image data sometimes contains EI. This however should not end the image data.

Posted by "martijn.list" <ma...@gmail.com>.

> I'm still a little concerned about a <0x0A>EI<0x20> appearing within
> the Image data though, but it looks like the spec simply doesn't
> allow for that data to be in an Image. I'm not sure how much better
> we can do...

Yes, the specs do not explain how to handle cases where the image data
contains the EI keyword. What's strange is that they do not mention this
anywhere (at least I did not find it). 8.9.7 says:

" The bytes between the ID and EI operators shall be treated the same as
a stream object’s data (see 7.3.8, "Stream Objects"), even though they
do not follow the standard stream syntax."

Streams however should normally contain the length of the stream so in
principle it should be a problem when the data contains the endstream
keyword. The BI dictionary however does not contain the length of the
image data section.

7.3.8 mentions that "There should be an end-of-line marker after the
data...". EOL marker is defined as:

4.20
end-of-line marker (EOL marker)
one or two character sequence marking the end of a line of text,
consisting of a CARRIAGE RETURN character (0Dh) or a LINE FEED character
(0Ah) or a CARRIAGE RETURN followed immediately by a LINE
FEED

So I think your suggestion is correct. However all EOL markers should be
included.

The following EI's should mark the end of the image data (DI):

<0x0D>EI<WS>
<0x0A>EI<WS>
<0x0D><0x0A>EI<WS>

where <WS> is a whitespace.


If the image data contains one of these markers, the image data cannot
be extracted. The PDF specs should imho be more clear how to handle
these exceptional cases.


Kind regards,

Martijn



On 11/25/2010 12:22 AM, Adam@swmc.com wrote:
> I see what you mean about the spec being vague.  In the examples in the 
> PDF, the "EI" is always on its own line (implying a newline).  Given this, 
> and the data you found, I'd like to propose that we look for "<0x0A>EI" 
> followed by whitespace (I'm choosing whitespace simply because that's what 
> the PDF spec says should come after ID in section 8.9.7.  Although they 
> don't explicitly say what should follow EI, it'd make sense to be 
> consistent with the ID operator).  Whitespace is defined in the PDF spec 
> in table 1 (Section 7.2.2 Character Set) as 0x00 0x09 0x0A 0x0C 0x0D or 
> 0x20.
> 
> It seems like this would take care of the PDF in question and be a 
> reasonable way to interpret the spec.
> 
> I'm still a little concerned about a <0x0A>EI<0x20> appearing within the 
> Image data though, but it looks like the spec simply doesn't allow for 
> that data to be in an Image.  I'm not sure how much better we can do...
> 
> For reference, I'm using ISO32000-1:2008 as "the PDF spec".
> 
> ---- 
> Thanks,
> Adam
> 
> 
> 
> From:
> "martijn.list" <ma...@gmail.com>
> To:
> dev@pdfbox.apache.org
> Date:
> 11/24/2010 12:51
> Subject:
> Image data sometimes contains EI. This however should not end the image 
> data.
> 
> 
> 
> I have spent some time investigating why PDFBox fails to parse the PDF
> from  https://issues.apache.org/jira/browse/PDFBOX-789.
> 
> After a long debugging session I finally found what's wrong with the PDF
> (well at least with one part).
> 
> The PDF contains multiple inline images. With one of the images it
> appears that the image data (the data after the DI token) contains EI
> which is an end token. The first EI however, is part of the image and
> should not end the image data.
> 
> The following snippet shows that there is an EI which is part of the
> image data:
> 
> BI
> /CS/RGB
> /W 795
> /H 1
> /BPC 8
> /F/Fl
> /DP<</Predictor 15
> /Columns 795
> /Colors 3>>
> ID x<9c>í<95>ÁmÃ0^LEI ...<-- the 'wrong' EI
> EI Q <-- the correct EI
> q 795 0 0 1 1863 3028.67 cm
> BI
> 
> The correct EI is separated by a newline (0x0A) and a space (0x20), i.e.
> <0x0A>EI<0x20>. The wrong EI is separated by a formfeed (0x0C) and a
> newline, i.e. <0x0C>EI<0x0A>
> 
> PDFStreamParser already notices that the PDF specs are not really clear:
> 
> "PDF spec is kinda unclear about this.  Should a whitespace always
> appear before EI?"
> 
> Is there something we can do to make this more robust? Should the EI
> always end with a space?
> 
> Kind regards,
> 
> Martijn Brinkers
> 
> 
> 
> - FHA 203b; 203k; HECM; VA; USDA; Conventional 
> - Warehouse Lines; FHA-Authorized Originators 
> - Lending and Servicing in over 45 States 
> www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
> This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  


-- 
Djigzo open source email encryption

Re: Image data sometimes contains EI. This however should not end the image data.

Posted by Ad...@swmc.com.

I see what you mean about the spec being vague.  In the examples in the 
PDF, the "EI" is always on its own line (implying a newline).  Given this, 
and the data you found, I'd like to propose that we look for "<0x0A>EI" 
followed by whitespace (I'm choosing whitespace simply because that's what 
the PDF spec says should come after ID in section 8.9.7.  Although they 
don't explicitly say what should follow EI, it'd make sense to be 
consistent with the ID operator).  Whitespace is defined in the PDF spec 
in table 1 (Section 7.2.2 Character Set) as 0x00 0x09 0x0A 0x0C 0x0D or 
0x20.

It seems like this would take care of the PDF in question and be a 
reasonable way to interpret the spec.

I'm still a little concerned about a <0x0A>EI<0x20> appearing within the 
Image data though, but it looks like the spec simply doesn't allow for 
that data to be in an Image.  I'm not sure how much better we can do...

For reference, I'm using ISO32000-1:2008 as "the PDF spec".

---- 
Thanks,
Adam



From:
"martijn.list" <ma...@gmail.com>
To:
dev@pdfbox.apache.org
Date:
11/24/2010 12:51
Subject:
Image data sometimes contains EI. This however should not end the image 
data.



I have spent some time investigating why PDFBox fails to parse the PDF
from  https://issues.apache.org/jira/browse/PDFBOX-789.

After a long debugging session I finally found what's wrong with the PDF
(well at least with one part).

The PDF contains multiple inline images. With one of the images it
appears that the image data (the data after the DI token) contains EI
which is an end token. The first EI however, is part of the image and
should not end the image data.

The following snippet shows that there is an EI which is part of the
image data:

BI
/CS/RGB
/W 795
/H 1
/BPC 8
/F/Fl
/DP<</Predictor 15
/Columns 795
/Colors 3>>
ID x<9c>í<95>ÁmÃ0^LEI ...<-- the 'wrong' EI
EI Q <-- the correct EI
q 795 0 0 1 1863 3028.67 cm
BI

The correct EI is separated by a newline (0x0A) and a space (0x20), i.e.
<0x0A>EI<0x20>. The wrong EI is separated by a formfeed (0x0C) and a
newline, i.e. <0x0C>EI<0x0A>

PDFStreamParser already notices that the PDF specs are not really clear:

"PDF spec is kinda unclear about this.  Should a whitespace always
appear before EI?"

Is there something we can do to make this more robust? Should the EI
always end with a space?

Kind regards,

Martijn Brinkers



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.