You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Oleg Kalnichevski <ol...@apache.org> on 2008/07/17 20:48:44 UTC

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > Stefano Bagnara wrote:
> >> Stefano Bagnara ha scritto:
> >>> I noticed that at a point in past the EOLConvertingInputStream has 
> >>> been removed from the chain.
> >>>
> >>> I think this create issues when we parse an input file having only \n 
> >>> and write it in output.
> >>>
> >>> - It seems that we parse most of the code only checking for \n (what 
> >>> does it happen when instead there are only \r? what should we do?)
> >>>
> > 
> > As far as I know a single CR is not used as a valid line delimiter 
> > anywhere. Please correct me if I am wrong.
> 
> AFAIK old MacOS (<X) use CR as their line delimiter.
> This is the same as unixes using LF.
> 
> >>> - If the message have only newlines it seems mime4j ends up 
> >>> outputting headers with CRLF and body with LF.
> > 
> > Why is it a problem? Headers serve a specific role. They convey metadata 
> > about a content body. The transport aspects of metadata are irrelevant, 
> > whereas one _usually_ does not want to a content body to go through a 
> > process of unnecessary transformation.
> 
> I don't understand what "specific role" is related to the RFC: I'm 
> talking about rfc compliance and real world cases as 2 different things. 
> First we have to understand what does it means to be RFC compliant and 
> what is a valid mime content and what is a valid "permissive parsing" 
> from the RFC PoV (as an example if we didn't read the rfc we now would 
> have a not compliant mime parser because of the outerboundaries not 
> having precedence on the nested boundaries).
> 
> You may know that there are specific MIME contents (e.g: delivery 
> notifications) having "header-style" lines in the content: so why 
> headers are different from the body? Why should we convert headers to 
> CRLF? Either we care about a compliant output or I don't understand why 
> we should put CRLF in headers.
> 

Stefano

You are very welcome to 


> >>> - If the input message have CR ending lines they are not considered 
> >>> by mime4j.
> >>>
> >>> IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
> >>>
> > 
> > I respectfully disagree.
> 
> That's good: disagreement allow discussion and allow us to understand 
> why something is good or bad.
> The important thing is that the mime4j community share a goal otherwise 
> each one will commit code diverging from the goal of the other.
> We cannot simply change the behaviour of mime4j because one user need 
> this without discussion or analysis.
> The RFC is our first resource, then we have real world use case to deal 
> with, and user requirements are on a third layer and have to comply with 
> previous requirement.
> 
> Maybe the right solution is making the behaviour configurable, I don't 
> know this, but I think that it's clear we need to discuss the issue 
> because otherwise we simply move away from the RFCs.
> 
> >>> If we do that we have to take care of encoded nested messages: they 
> >>> could have again LF, CR and CRLF like the top stream.
> >>>
> >>>
> >>> What is the right approach? Should we add a EOLConvertingInputStream 
> >>> (CONVERT_BOTH) to every level of parsing or should we fail to parse 
> >>> messages with bad newlines?
> >>>
> >>> I don't like the current behaviour where we accept some malformed 
> >>> data (LF alone are considered CRLF from our parser), we change some 
> >>> of them (the one between headers are converted to CRLF) and we still 
> >>> output malformed data.
> >>>
> >>> Opinions?
> >>
> >> I tried this patch and it seems to work fine (even if it breaks one of 
> >> our core tests that do not expect a CR in an header to be considered a 
> >> newline):
> >>
> > 
> > Not only does this change completely reverts the performance gains and 
> > makes the whole refactroring exercise completely pointless due to an 
> > utterly inefficient implementation of EOLConvertingInputStream, it is 
> > also conceptually wrong (in my humble opinion), as it causes mime4j to 
> > corrupt 8bit encoded 'application/octet-stream' content. This basically 
> > renders mime4j incompatible with commons browsers and HttpClient
> 
> The performance of the EOLConvertingInputStream is not important at all 
> if removing it we have an unusable library. So let's talk about what we 
> expect from the library, then we'll discuss how to make it performant. I 
> believe we have technical skills to make a performant EOLConverting stream.
> 
> About the 8bit encoded 'application/octet-stream' I think we just need 
> to find the right RFC telling us what we have to do: the RFC I read 
> about MIME and its applications always tell that CR and LF must not be 
> alone and that the appropriate transfer encoding have to be used in 
> order to avoid isolated LF and CR: it is not a matter of personal 
> preferences, it is a matter of rfc compliance. Let's find the docs, first.
> 
> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
> -------------------
> "8bit data" refers to data that is all represented as relatively
> short lines with 998 octets or less between CRLF line separation
> sequences [RFC-821]), but octets with decimal values greater than 127
> may be used.  As with "7bit data" CR and LF octets only occur as part
> of CRLF line separation sequences and no NULs are allowed.
> -------------------
> 

Stefano,

You are very welcome to impose whatever strict interpretation of the
relevant RFCs are your hearts desires. Just please leave on option
allowing to override it so that the mime4j parser could be used to parse
real-world content.

Oleg

> So this would say that 8bit encoded 'application/octet-stream' have 
> anyway lines of 998 chars and does not include isolated CR and LF.
> 
> We have to understand if real world abused the 8bit specification or if 
> there is some mime extension we are not considering: this is important, 
> otherwise we will be the next abuser of the RFC. Apache JAMES PMC agreed 
> (in past, multiple times) that we have to make sure that we are strict 
> about mime written by mime4j and we are permissive with input.
> 
> > If you commit this change could you please provide an option to exclude 
> > EOLConvertingInputStream filter?
> 
> I'm not going to commit anything without agreement on what we want to 
> do. If *I* am the only one that care about the RFC we can even ignore 
> this thread at all, but my duty as PMC member is to raise similar issue 
> and to let the community decide.
> 
> > Thank you
> 
> Thank you, too!
> 
> > Oleg
> 
> Stefano
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.

Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 09:56 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> Not only does this change completely reverts the performance gains and 
>>>>> makes the whole refactroring exercise completely pointless due to an 
>>>>> utterly inefficient implementation of EOLConvertingInputStream, it is 
>>>>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
>>>>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
>>>>> renders mime4j incompatible with commons browsers and HttpClient
>>>> The performance of the EOLConvertingInputStream is not important at all 
>>>> if removing it we have an unusable library. So let's talk about what we 
>>>> expect from the library, then we'll discuss how to make it performant. I 
>>>> believe we have technical skills to make a performant EOLConverting stream.
>>>>
>>>> About the 8bit encoded 'application/octet-stream' I think we just need 
>>>> to find the right RFC telling us what we have to do: the RFC I read 
>>>> about MIME and its applications always tell that CR and LF must not be 
>>>> alone and that the appropriate transfer encoding have to be used in 
>>>> order to avoid isolated LF and CR: it is not a matter of personal 
>>>> preferences, it is a matter of rfc compliance. Let's find the docs, first.
>>>>
>>>> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
>>>> -------------------
>>>> "8bit data" refers to data that is all represented as relatively
>>>> short lines with 998 octets or less between CRLF line separation
>>>> sequences [RFC-821]), but octets with decimal values greater than 127
>>>> may be used.  As with "7bit data" CR and LF octets only occur as part
>>>> of CRLF line separation sequences and no NULs are allowed.
>>>> -------------------
>>>>
>>> Stefano,
>>>
>>> You are very welcome to impose whatever strict interpretation of the
>>> relevant RFCs are your hearts desires. Just please leave on option
>>> allowing to override it so that the mime4j parser could be used to parse
>>> real-world content.
>> Oleg, don't take me wrong. I simply want to make sure we all understand 
>> what RFC say and understand the specific cases we are ignoring it and WHY.
>>
>> In the case of outer boundary we introduced backward compatibility 
>> issues in the name of performance mainly because of lack of knowledge of 
>> the RFCs. I'm not an expert, too, but I think it is important to at 
>> least take them into consideration once we find the right docs.
>>
>> I'm not saying that we MUST be 100% compliant and strict, but I want to 
>> make sure we know when we are doing something not compliant and that we 
>> agree that it is good.
>>
>> One of the main goal is interoperability, so everytime we do something 
>> different from what RFC tell us we have to make sure we are not breaking 
>> interoperability with other RFC compliant tools.
>>
>> I'm far from being a MIME expert, so I find it difficult to keep up with 
>> this discussion if I have to convince people of something. I just want 
>> to share my little knowledge about the (mainly SMTP related) RFCs.
>>
>> Stefano
>>
> 
> Stefano,
> 
> The core of this issue is not about standards compliance. I am fine with
> mime4j being strict in its interpretation of relevant RFCs per default.
> However, the idea of _indiscriminate_ conversion of line delimiters
> regardless of their occurrence in the data stream seems _very_, _very_
> __conceptually__ wrong to me.
> 
> I can't help feeling that Ayatollah style orthodoxy about line
> delimiters handling just does not really help anyone. Fortunately for
> JAMES, MTAs an MUAs are too complex to be written by complete muppets.
> We do not have that privilege in the HTTP world where one has no other
> choice but to interoperate with tons of HTTP agents and CGI scripts
> written with a complete disregard of standards. So, in the
> HttpComponents project we have a very simple policy: be lenient about
> parsing, be strict about formatting. That seems to work well for _us_.
> 
> Oleg

"be lenient about parsing, be strict about formatting" is exactly what 
the JAMES PMC agreed in the guidelines.

Conversion should not be done at all but we want to be lenient so we do 
some conversion to support some non compliant agent. I also agree that 
conversion may not be appropriate in any case, and that's why we 
discuss. It doesn't worth keep discussing this issue at this high level 
and instead I would like to keep our focus on real solutions.

In fact, after reading a lot of RFC today, I think that what you get 
from HTTP is perfectly standard behaviour (I'm not sure if they miss the 
"Content-Transfer-Encoding: binary" header of if some RFC define it as 
the default for HTTP) but I found rfc1867 telling that it is common to 
use the binary transfer encoding in multipart/form-data mime parts in 
HTTP, so the fact is probably that what you want is what the RFC ask us 
to implement, but we first understand things and then do things ;-)

And be sure that the same issues you find with HTTP client exists also 
with MUA and MTA. Muppets are all around and we care for RFC so much 
exactly because we don't want other people to call us muppets ;-)

Please read other messages I posted in this thread today, because I 
think they are more concrete and propositive than this leaf of the thread.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Fri, 2008-07-18 at 09:56 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> Not only does this change completely reverts the performance gains and 
> >>> makes the whole refactroring exercise completely pointless due to an 
> >>> utterly inefficient implementation of EOLConvertingInputStream, it is 
> >>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
> >>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
> >>> renders mime4j incompatible with commons browsers and HttpClient
> >> The performance of the EOLConvertingInputStream is not important at all 
> >> if removing it we have an unusable library. So let's talk about what we 
> >> expect from the library, then we'll discuss how to make it performant. I 
> >> believe we have technical skills to make a performant EOLConverting stream.
> >>
> >> About the 8bit encoded 'application/octet-stream' I think we just need 
> >> to find the right RFC telling us what we have to do: the RFC I read 
> >> about MIME and its applications always tell that CR and LF must not be 
> >> alone and that the appropriate transfer encoding have to be used in 
> >> order to avoid isolated LF and CR: it is not a matter of personal 
> >> preferences, it is a matter of rfc compliance. Let's find the docs, first.
> >>
> >> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
> >> -------------------
> >> "8bit data" refers to data that is all represented as relatively
> >> short lines with 998 octets or less between CRLF line separation
> >> sequences [RFC-821]), but octets with decimal values greater than 127
> >> may be used.  As with "7bit data" CR and LF octets only occur as part
> >> of CRLF line separation sequences and no NULs are allowed.
> >> -------------------
> >>
> > 
> > Stefano,
> > 
> > You are very welcome to impose whatever strict interpretation of the
> > relevant RFCs are your hearts desires. Just please leave on option
> > allowing to override it so that the mime4j parser could be used to parse
> > real-world content.
> 
> Oleg, don't take me wrong. I simply want to make sure we all understand 
> what RFC say and understand the specific cases we are ignoring it and WHY.
> 
> In the case of outer boundary we introduced backward compatibility 
> issues in the name of performance mainly because of lack of knowledge of 
> the RFCs. I'm not an expert, too, but I think it is important to at 
> least take them into consideration once we find the right docs.
> 
> I'm not saying that we MUST be 100% compliant and strict, but I want to 
> make sure we know when we are doing something not compliant and that we 
> agree that it is good.
> 
> One of the main goal is interoperability, so everytime we do something 
> different from what RFC tell us we have to make sure we are not breaking 
> interoperability with other RFC compliant tools.
> 
> I'm far from being a MIME expert, so I find it difficult to keep up with 
> this discussion if I have to convince people of something. I just want 
> to share my little knowledge about the (mainly SMTP related) RFCs.
> 
> Stefano
> 

Stefano,

The core of this issue is not about standards compliance. I am fine with
mime4j being strict in its interpretation of relevant RFCs per default.
However, the idea of _indiscriminate_ conversion of line delimiters
regardless of their occurrence in the data stream seems _very_, _very_
__conceptually__ wrong to me.

I can't help feeling that Ayatollah style orthodoxy about line
delimiters handling just does not really help anyone. Fortunately for
JAMES, MTAs an MUAs are too complex to be written by complete muppets.
We do not have that privilege in the HTTP world where one has no other
choice but to interoperate with tons of HTTP agents and CGI scripts
written with a complete disregard of standards. So, in the
HttpComponents project we have a very simple policy: be lenient about
parsing, be strict about formatting. That seems to work well for _us_.

Oleg



> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.

Oleg Kalnichevski ha scritto:
> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> Not only does this change completely reverts the performance gains and 
>>> makes the whole refactroring exercise completely pointless due to an 
>>> utterly inefficient implementation of EOLConvertingInputStream, it is 
>>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
>>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
>>> renders mime4j incompatible with commons browsers and HttpClient
>> The performance of the EOLConvertingInputStream is not important at all 
>> if removing it we have an unusable library. So let's talk about what we 
>> expect from the library, then we'll discuss how to make it performant. I 
>> believe we have technical skills to make a performant EOLConverting stream.
>>
>> About the 8bit encoded 'application/octet-stream' I think we just need 
>> to find the right RFC telling us what we have to do: the RFC I read 
>> about MIME and its applications always tell that CR and LF must not be 
>> alone and that the appropriate transfer encoding have to be used in 
>> order to avoid isolated LF and CR: it is not a matter of personal 
>> preferences, it is a matter of rfc compliance. Let's find the docs, first.
>>
>> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
>> -------------------
>> "8bit data" refers to data that is all represented as relatively
>> short lines with 998 octets or less between CRLF line separation
>> sequences [RFC-821]), but octets with decimal values greater than 127
>> may be used.  As with "7bit data" CR and LF octets only occur as part
>> of CRLF line separation sequences and no NULs are allowed.
>> -------------------
>>
> 
> Stefano,
> 
> You are very welcome to impose whatever strict interpretation of the
> relevant RFCs are your hearts desires. Just please leave on option
> allowing to override it so that the mime4j parser could be used to parse
> real-world content.

Oleg, don't take me wrong. I simply want to make sure we all understand 
what RFC say and understand the specific cases we are ignoring it and WHY.

In the case of outer boundary we introduced backward compatibility 
issues in the name of performance mainly because of lack of knowledge of 
the RFCs. I'm not an expert, too, but I think it is important to at 
least take them into consideration once we find the right docs.

I'm not saying that we MUST be 100% compliant and strict, but I want to 
make sure we know when we are doing something not compliant and that we 
agree that it is good.

One of the main goal is interoperability, so everytime we do something 
different from what RFC tell us we have to make sure we are not breaking 
interoperability with other RFC compliant tools.

I'm far from being a MIME expert, so I find it difficult to keep up with 
this discussion if I have to convince people of something. I just want 
to share my little knowledge about the (mainly SMTP related) RFCs.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org