You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Oleg Kalnichevski <ol...@apache.org> on 2008/07/18 15:19:54 UTC

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >>>> Oleg Kalnichevski ha scritto:
> >>>>> Stefano Bagnara wrote:
> > 

...

> As I said the strict mode would only be useful to users of mime4j 
> wanting to use mime4j as a validator to check RFC compliance. You know, 
> mime4j born for SMTP, but now you need it for HTTP and someone else may 
> want to do a validator. So let's not keep our eyes closed once again.
> 

OK, I fail to see any practical benefit of that aside from a nice warm
feeling about being 100% compliant, but I admit I am biased.

> > Anyways, let's talk code now. How about this?
> > 
> > (1)
> > 
> > interface LineDelimiterStrategy {
> > 
> >  boolean isNewLine(char ch1, char ch2) // both can be -1
> > 	throws MimeException;
> > 
> > }
> > 
> > One can provide MimeTokenStream with an implementation of this interface
> > at the construction time. MimeTokenStream it its turn passes a
> > reference to that class to all parser components that need to deal with
> > line delimiters.
> 
> I'm not sure I understand what are the 2 params passed to isNewLine and 
> what code will invoke this service.
> 

2 consecutive characters read from the data stream or -1 if any of those
characters is not available. 


> > (2) The issue of CR / LF handling in content bodies should be taken of
> > when formatting output, _not_ when parsing input.
> > 
> > Would that work for you?
> 
> I'm not sure this is enough.
> In output we format what we parser: if we parsed the input as multiple 
> lines then we output multiple lines, otherwise we output a single line. 
> So it is during parsing that we have to decide whether an isolated LF is 
> a newline delimiter or not.

But mime4j does not parse _content bodies_ as multiple lines, does it?

At this point I think I have to give up. Whatever you end up doing
_please_ do not wrap the raw data stream with EOLConvertingInputStream.

Cheers

Oleg 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.

Oleg Kalnichevski ha scritto:
> Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>>>> Stefano Bagnara wrote:
>>>>> ...
>>>>>
>>>>>> As I said the strict mode would only be useful to users of mime4j 
>>>>>> wanting to use mime4j as a validator to check RFC compliance. You 
>>>>>> know, mime4j born for SMTP, but now you need it for HTTP and 
>>>>>> someone else may want to do a validator. So let's not keep our 
>>>>>> eyes closed once again.
>>>>>>
>>>>> OK, I fail to see any practical benefit of that aside from a nice warm
>>>>> feeling about being 100% compliant, but I admit I am biased.
>>>>>
>>>>>>> Anyways, let's talk code now. How about this?
>>>>>>>
>>>>>>> (1)
>>>>>>>
>>>>>>> interface LineDelimiterStrategy {
>>>>>>>
>>>>>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>>>>>>     throws MimeException;
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> One can provide MimeTokenStream with an implementation of this 
>>>>>>> interface
>>>>>>> at the construction time. MimeTokenStream it its turn passes a
>>>>>>> reference to that class to all parser components that need to 
>>>>>>> deal with
>>>>>>> line delimiters.
>>>>>> I'm not sure I understand what are the 2 params passed to 
>>>>>> isNewLine and what code will invoke this service.
>>>>>>
>>>>> 2 consecutive characters read from the data stream or -1 if any of 
>>>>> those
>>>>> characters is not available. 
>>>> so "a\r\nb" would result in the calls:
>>>> isNewLine(-1,'a');
>>>> isNewLine('a','\r');
>>>> isNewLine('\r','\n');
>>>> isNewLine('\n','b');
>>>> isNewLine('b',-1);
>>>> is this correct? What would be the result for the 5 above from the 
>>>> implementation that will be fine in HTTP?
>>>>
>>>
>>> Anything that allows:
>>>
>>> line delimiter = (LF|CRLF)
>>
>> I understood this, but I'm not following you on how your do this with 
>> the Interface you was proposing.
>> Given your rule you have true on the 3rd and the 4th call? Wouldn't 
>> this result in 2 newlines?
>>
> 
> I do not think so, only a sequence with ch2 = '\n' would be considered a 
> valid line delimiter. I realized, though, the problem with this 
> interface is that it implied a one byte read I had thought we wanted to 
> get rid of.

I understand it now, thank you!

>>>>>>> (2) The issue of CR / LF handling in content bodies should be 
>>>>>>> taken of
>>>>>>> when formatting output, _not_ when parsing input.
>>>>>>>
>>>>>>> Would that work for you?
>>>>>> I'm not sure this is enough.
>>>>>> In output we format what we parser: if we parsed the input as 
>>>>>> multiple lines then we output multiple lines, otherwise we output 
>>>>>> a single line. So it is during parsing that we have to decide 
>>>>>> whether an isolated LF is a newline delimiter or not.
>>>>> But mime4j does not parse _content bodies_ as multiple lines, does it?
>>>> TextBody.getReader()
>>>>
>>>
>>> But that does not necessarily imply parsing into multiple lines, does
>>> it? Anyways, I perfectly am fine with TexyBody automatically converting
>>> line delimiters. IMHO this is the right place to do the conversion, but
>>> not the MimeTokenStream
>>
>> You are right, the Reader does not imply line parsing, but anyway 
>> somewhere we have to deal with lines.
>> Mime4J basic classes (the whole LineReaderInputStream hierarchy) have 
>> indeed a readLine method. This just made me realize that the internal 
>> buffer is filled with lines and that sending a very long binary make 
>> mime4j die with OOM.
> 
> No, it would not. Binary content is not read line by line. The #readLine 
> method is only used when parsing metadata (header fields), where we do 
> need to put a cap on the max line length, as discussed before.

My fault: I had code casting to LineReaderInputStream and using readLine 
to get the content, but the method indeed returned me only an 
InputStream and there is no way to throw the OOM without using a cast.

About the line length limit we really need it: a random sequence of 
non-LF chars currently make our code to throw an OOM.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.

Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
>> On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
>>> Oleg Kalnichevski ha scritto:
>>>> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>>>>> Oleg Kalnichevski ha scritto:
>>>>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>>> Stefano Bagnara wrote:
>>>> ...
>>>>
>>>>> As I said the strict mode would only be useful to users of mime4j 
>>>>> wanting to use mime4j as a validator to check RFC compliance. You 
>>>>> know, mime4j born for SMTP, but now you need it for HTTP and 
>>>>> someone else may want to do a validator. So let's not keep our eyes 
>>>>> closed once again.
>>>>>
>>>> OK, I fail to see any practical benefit of that aside from a nice warm
>>>> feeling about being 100% compliant, but I admit I am biased.
>>>>
>>>>>> Anyways, let's talk code now. How about this?
>>>>>>
>>>>>> (1)
>>>>>>
>>>>>> interface LineDelimiterStrategy {
>>>>>>
>>>>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>>>>>     throws MimeException;
>>>>>>
>>>>>> }
>>>>>>
>>>>>> One can provide MimeTokenStream with an implementation of this 
>>>>>> interface
>>>>>> at the construction time. MimeTokenStream it its turn passes a
>>>>>> reference to that class to all parser components that need to deal 
>>>>>> with
>>>>>> line delimiters.
>>>>> I'm not sure I understand what are the 2 params passed to isNewLine 
>>>>> and what code will invoke this service.
>>>>>
>>>> 2 consecutive characters read from the data stream or -1 if any of 
>>>> those
>>>> characters is not available. 
>>> so "a\r\nb" would result in the calls:
>>> isNewLine(-1,'a');
>>> isNewLine('a','\r');
>>> isNewLine('\r','\n');
>>> isNewLine('\n','b');
>>> isNewLine('b',-1);
>>> is this correct? What would be the result for the 5 above from the 
>>> implementation that will be fine in HTTP?
>>>
>>
>> Anything that allows:
>>
>> line delimiter = (LF|CRLF)
> 
> I understood this, but I'm not following you on how your do this with 
> the Interface you was proposing.
> Given your rule you have true on the 3rd and the 4th call? Wouldn't this 
> result in 2 newlines?
> 

I do not think so, only a sequence with ch2 = '\n' would be considered a 
valid line delimiter. I realized, though, the problem with this 
interface is that it implied a one byte read I had thought we wanted to 
get rid of.


>>>>>> (2) The issue of CR / LF handling in content bodies should be 
>>>>>> taken of
>>>>>> when formatting output, _not_ when parsing input.
>>>>>>
>>>>>> Would that work for you?
>>>>> I'm not sure this is enough.
>>>>> In output we format what we parser: if we parsed the input as 
>>>>> multiple lines then we output multiple lines, otherwise we output a 
>>>>> single line. So it is during parsing that we have to decide whether 
>>>>> an isolated LF is a newline delimiter or not.
>>>> But mime4j does not parse _content bodies_ as multiple lines, does it?
>>> TextBody.getReader()
>>>
>>
>> But that does not necessarily imply parsing into multiple lines, does
>> it? Anyways, I perfectly am fine with TexyBody automatically converting
>> line delimiters. IMHO this is the right place to do the conversion, but
>> not the MimeTokenStream
> 
> You are right, the Reader does not imply line parsing, but anyway 
> somewhere we have to deal with lines.
> Mime4J basic classes (the whole LineReaderInputStream hierarchy) have 
> indeed a readLine method. This just made me realize that the internal 
> buffer is filled with lines and that sending a very long binary make 
> mime4j die with OOM.

No, it would not. Binary content is not read line by line. The #readLine 
method is only used when parsing metadata (header fields), where we do 
need to put a cap on the max line length, as discussed before.

Oleg


  We can fix this OOM during standard parsing by
> having an hard limit on the size (and throwing exception otherwise) but 
> we have to do this differently during the streaming of "binary" encoded 
> parts (line reading makes no sense there).
> 
> Furthermore, at the very minimum we have a RootInputStream only counting 
> lines if they are CRLF terminated. It seems weird that we count lines 
> only if their are CRLF terminated but we recognize them also if they are 
> LF ending (this is one more issue to be taken in consideration, not the 
> one we was talking about).
> 
> Stefano
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.

Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>> Stefano Bagnara wrote:
>>> ...
>>>
>>>> As I said the strict mode would only be useful to users of mime4j 
>>>> wanting to use mime4j as a validator to check RFC compliance. You know, 
>>>> mime4j born for SMTP, but now you need it for HTTP and someone else may 
>>>> want to do a validator. So let's not keep our eyes closed once again.
>>>>
>>> OK, I fail to see any practical benefit of that aside from a nice warm
>>> feeling about being 100% compliant, but I admit I am biased.
>>>
>>>>> Anyways, let's talk code now. How about this?
>>>>>
>>>>> (1)
>>>>>
>>>>> interface LineDelimiterStrategy {
>>>>>
>>>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>>>> 	throws MimeException;
>>>>>
>>>>> }
>>>>>
>>>>> One can provide MimeTokenStream with an implementation of this interface
>>>>> at the construction time. MimeTokenStream it its turn passes a
>>>>> reference to that class to all parser components that need to deal with
>>>>> line delimiters.
>>>> I'm not sure I understand what are the 2 params passed to isNewLine and 
>>>> what code will invoke this service.
>>>>
>>> 2 consecutive characters read from the data stream or -1 if any of those
>>> characters is not available. 
>> so "a\r\nb" would result in the calls:
>> isNewLine(-1,'a');
>> isNewLine('a','\r');
>> isNewLine('\r','\n');
>> isNewLine('\n','b');
>> isNewLine('b',-1);
>> is this correct? What would be the result for the 5 above from the 
>> implementation that will be fine in HTTP?
>>
> 
> Anything that allows:
> 
> line delimiter = (LF|CRLF)

I understood this, but I'm not following you on how your do this with 
the Interface you was proposing.
Given your rule you have true on the 3rd and the 4th call? Wouldn't this 
result in 2 newlines?

>>>>> (2) The issue of CR / LF handling in content bodies should be taken of
>>>>> when formatting output, _not_ when parsing input.
>>>>>
>>>>> Would that work for you?
>>>> I'm not sure this is enough.
>>>> In output we format what we parser: if we parsed the input as multiple 
>>>> lines then we output multiple lines, otherwise we output a single line. 
>>>> So it is during parsing that we have to decide whether an isolated LF is 
>>>> a newline delimiter or not.
>>> But mime4j does not parse _content bodies_ as multiple lines, does it?
>> TextBody.getReader()
>>
> 
> But that does not necessarily imply parsing into multiple lines, does
> it? Anyways, I perfectly am fine with TexyBody automatically converting
> line delimiters. IMHO this is the right place to do the conversion, but
> not the MimeTokenStream

You are right, the Reader does not imply line parsing, but anyway 
somewhere we have to deal with lines.
Mime4J basic classes (the whole LineReaderInputStream hierarchy) have 
indeed a readLine method. This just made me realize that the internal 
buffer is filled with lines and that sending a very long binary make 
mime4j die with OOM. We can fix this OOM during standard parsing by 
having an hard limit on the size (and throwing exception otherwise) but 
we have to do this differently during the streaming of "binary" encoded 
parts (line reading makes no sense there).

Furthermore, at the very minimum we have a RootInputStream only counting 
lines if they are CRLF terminated. It seems weird that we count lines 
only if their are CRLF terminated but we recognize them also if they are 
LF ending (this is one more issue to be taken in consideration, not the 
one we was talking about).

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
> >>>> Oleg Kalnichevski ha scritto:
> >>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >>>>>> Oleg Kalnichevski ha scritto:
> >>>>>>> Stefano Bagnara wrote:
> > 
> > ...
> > 
> >> As I said the strict mode would only be useful to users of mime4j 
> >> wanting to use mime4j as a validator to check RFC compliance. You know, 
> >> mime4j born for SMTP, but now you need it for HTTP and someone else may 
> >> want to do a validator. So let's not keep our eyes closed once again.
> >>
> > 
> > OK, I fail to see any practical benefit of that aside from a nice warm
> > feeling about being 100% compliant, but I admit I am biased.
> > 
> >>> Anyways, let's talk code now. How about this?
> >>>
> >>> (1)
> >>>
> >>> interface LineDelimiterStrategy {
> >>>
> >>>  boolean isNewLine(char ch1, char ch2) // both can be -1
> >>> 	throws MimeException;
> >>>
> >>> }
> >>>
> >>> One can provide MimeTokenStream with an implementation of this interface
> >>> at the construction time. MimeTokenStream it its turn passes a
> >>> reference to that class to all parser components that need to deal with
> >>> line delimiters.
> >> I'm not sure I understand what are the 2 params passed to isNewLine and 
> >> what code will invoke this service.
> >>
> > 
> > 2 consecutive characters read from the data stream or -1 if any of those
> > characters is not available. 
> 
> so "a\r\nb" would result in the calls:
> isNewLine(-1,'a');
> isNewLine('a','\r');
> isNewLine('\r','\n');
> isNewLine('\n','b');
> isNewLine('b',-1);
> is this correct? What would be the result for the 5 above from the 
> implementation that will be fine in HTTP?
> 

Anything that allows:

line delimiter = (LF|CRLF)


> >>> (2) The issue of CR / LF handling in content bodies should be taken of
> >>> when formatting output, _not_ when parsing input.
> >>>
> >>> Would that work for you?
> >> I'm not sure this is enough.
> >> In output we format what we parser: if we parsed the input as multiple 
> >> lines then we output multiple lines, otherwise we output a single line. 
> >> So it is during parsing that we have to decide whether an isolated LF is 
> >> a newline delimiter or not.
> > 
> > But mime4j does not parse _content bodies_ as multiple lines, does it?
> 
> TextBody.getReader()
> 

But that does not necessarily imply parsing into multiple lines, does
it? Anyways, I perfectly am fine with TexyBody automatically converting
line delimiters. IMHO this is the right place to do the conversion, but
not the MimeTokenStream

> > At this point I think I have to give up. Whatever you end up doing
> > _please_ do not wrap the raw data stream with EOLConvertingInputStream.
> 
> Sure, I already excluded this: I now understand the "C-T-E: binary" issue.
> BTW I hope you will keep monitoring this issue so you can confirm 
> whatever solution we propose will be fine with your library?
> 

Sure.

Oleg


> Thank you,
> Stefano
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.

Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> Stefano Bagnara wrote:
> 
> ...
> 
>> As I said the strict mode would only be useful to users of mime4j 
>> wanting to use mime4j as a validator to check RFC compliance. You know, 
>> mime4j born for SMTP, but now you need it for HTTP and someone else may 
>> want to do a validator. So let's not keep our eyes closed once again.
>>
> 
> OK, I fail to see any practical benefit of that aside from a nice warm
> feeling about being 100% compliant, but I admit I am biased.
> 
>>> Anyways, let's talk code now. How about this?
>>>
>>> (1)
>>>
>>> interface LineDelimiterStrategy {
>>>
>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>> 	throws MimeException;
>>>
>>> }
>>>
>>> One can provide MimeTokenStream with an implementation of this interface
>>> at the construction time. MimeTokenStream it its turn passes a
>>> reference to that class to all parser components that need to deal with
>>> line delimiters.
>> I'm not sure I understand what are the 2 params passed to isNewLine and 
>> what code will invoke this service.
>>
> 
> 2 consecutive characters read from the data stream or -1 if any of those
> characters is not available. 

so "a\r\nb" would result in the calls:
isNewLine(-1,'a');
isNewLine('a','\r');
isNewLine('\r','\n');
isNewLine('\n','b');
isNewLine('b',-1);
is this correct? What would be the result for the 5 above from the 
implementation that will be fine in HTTP?

>>> (2) The issue of CR / LF handling in content bodies should be taken of
>>> when formatting output, _not_ when parsing input.
>>>
>>> Would that work for you?
>> I'm not sure this is enough.
>> In output we format what we parser: if we parsed the input as multiple 
>> lines then we output multiple lines, otherwise we output a single line. 
>> So it is during parsing that we have to decide whether an isolated LF is 
>> a newline delimiter or not.
> 
> But mime4j does not parse _content bodies_ as multiple lines, does it?

TextBody.getReader()

> At this point I think I have to give up. Whatever you end up doing
> _please_ do not wrap the raw data stream with EOLConvertingInputStream.

Sure, I already excluded this: I now understand the "C-T-E: binary" issue.
BTW I hope you will keep monitoring this issue so you can confirm 
whatever solution we propose will be fine with your library?

Thank you,
Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org