You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by Stefano Bagnara <ap...@bago.org> on 2008/07/17 16:38:31 UTC

[mime4j] newlines and parsing of nested (encoded) rfc822 messages

I noticed that at a point in past the EOLConvertingInputStream has been 
removed from the chain.

I think this create issues when we parse an input file having only \n 
and write it in output.

- It seems that we parse most of the code only checking for \n (what 
does it happen when instead there are only \r? what should we do?)

- If the message have only newlines it seems mime4j ends up outputting 
headers with CRLF and body with LF.

- If the input message have CR ending lines they are not considered by 
mime4j.

IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.

If we do that we have to take care of encoded nested messages: they 
could have again LF, CR and CRLF like the top stream.


What is the right approach? Should we add a EOLConvertingInputStream 
(CONVERT_BOTH) to every level of parsing or should we fail to parse 
messages with bad newlines?

I don't like the current behaviour where we accept some malformed data 
(LF alone are considered CRLF from our parser), we change some of them 
(the one between headers are converted to CRLF) and we still output 
malformed data.

Opinions?

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> On Mon, Jul 21, 2008 at 10:11 AM, Stefano Bagnara <ap...@bago.org> wrote:
>> Robert Burrell Donkin ha scritto:
>>> On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <ap...@bago.org> wrote:
>>>> Robert Burrell Donkin ha scritto:
>>>>> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <ap...@bago.org>
>>>>> wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> Robert Burrell Donkin wrote:
>>>>>>>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>>>>>>>> Robert Burrell Donkin ha scritto:
>>>>>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org>
>>>>>>>>>> wrote:
>>>>>>>>>>> Robert Burrell Donkin ha scritto:
>>>>> <snip>
>>>>>
>>>>>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this
>>>>>>>>> support
>>>>>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>>>>>> terminate with an isolated LF then also lines in text content should
>>>>>>>>> do
>>>>>>>>> the same (because probably the whole mime message has LF instead of
>>>>>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message
>>>>>>>>> is
>>>>>>>>> encoded using LF instead of CRLF and that this specific encoding
>>>>>>>>> breaks
>>>>>>>>> binary parts, but we want to be smarter wrt this issue].
>>>>>>>> TextBody is part of the DOM. This can and should be addressed there
>>>>>>>> (rather than in the parser). I think that doing this should satisfy
>>>>>>>> both needs without compromising the performance of the parser.
>>>>>>>>
>>>>>>> If this is indeed something we can all agree on, I can try to solve
>>>>>>> the
>>>>>>> first problem (strict/lenient line delimiter handling) using a
>>>>>>> pluggable
>>>>>>> strategy of some kind.
>>>>>>>
>>>>>>> Oleg
>>>>>> My limited knowledge of mime4j details doesn't let me reply "+1". So I
>>>>>> simply tell what I expect from mime4j as an user:
>>>>> it's important to understand that mime4j targets different kinds of
>>>>> user. the pull parser is a low level application agnostic interface
>>>>> aimed at experts who need performance. the DOM and SAX components are
>>>>> higher level interfaces for less experience users who are willing to
>>>>> compromise flexibility and performance. each user will have different
>>>>> expectations.
>>>>>
>>>>>> Lenient line delimiter parsing:
>>>>>> - consider isolated LF and CR in the mime stream as newlines as long as
>>>>>> a
>>>>>> newline concept exists in that specific place (everywhere but binary
>>>>>> body
>>>>>> parts having ContentTransferEncoding = "binary").
>>>>> the low level interface should allow the user to determine whether
>>>>> they want to canonicalise. the higher level interface should probably
>>>>> canonicalise.
>>>> I have an alternative proposal, see the bottom of this message.
>>>>
>>>>>> - This means that a CR in a base64 stream is a newline, a CR in a
>>>>>> text/plain
>>>>>> is a newline, a "CR<boundary> CR" sequence is a valid multipart
>>>>>> boundary,
>>>>>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF"
>>>>>> sequences
>>>>>> are valid separators between header and body because they are
>>>>>> considered
>>>>>> as
>>>>>> equivalent to "CRLFCRLF".
>>>>> i'm not sure i agree (i need to think about this a little more)
>>>> Ok, let me know your doubts as you get them.
>>>>
>>>>>> - THis also means that writing in output this stuff will result in a
>>>>>> mime
>>>>>> stream with NO isolated CRs or LFs (unless they are in a "binary"
>>>>>> encoded
>>>>>> body).
>>>>> i'm happy for the high level DOM API to perform conversions on the
>>>>> streams.
>>>>>
>>>>>> Strict line delimiter parsing (I don't care if we have this now, I just
>>>>>> think we should have this in mind while factoring mime4j because it
>>>>>> should
>>>>>> be possible to implement this with no major changes).
>>>>> this is a non-goal as far as i'm concerned. performant validating
>>>>> parsers tend to be more difficult to write. if a validating engine is
>>>>> needed then i'd prefer to approach the design without preconditions. i
>>>>> need a fast robust parser that is able to cope with practical MIME
>>>>> documents whether they are valid or not.
>>>> Ok, no one seems to care about strict parsing, so let's forget about this
>>>> for now, but please let me understand this:
>>>> I see mime4j already have a strict parsing concept about throwing
>>>> exceptions
>>>> vs monitor calls when it encounter malformed/unexpected content: what is
>>>> the
>>>> rationale for needing the current strict parsing while not needing the
>>>> CRLF
>>>> delimiter strict parsing?
>>>>
>>>>>> - LFs and CRs are not newlines, they are not considered newlines and
>>>>>> results
>>>>>> in errors raised by the parser (invalid header, invalid content, and so
>>>>>> on)
>>>>>> that will result in a parsing failure or (if the raised errors are
>>>>>> ignored)
>>>>>> in invalid DOM (I'm not sure how we currently handle this case for
>>>>>> non-expected 8bit content in an header, but it should be the same).
>>>>>> - writing in output this content should result in a well-formed
>>>>>> content,
>>>>>> so:
>>>>>>  - if an LF in the header is somehow "encodable" as a valid sequence it
>>>>>> should be parsed as LF and then encoded while outputting. If instead an
>>>>>> LF
>>>>>> in the header is not encodable then we should fail parsing or remove it
>>>>>> (or
>>>>>> convert it to "?" or anything similar) if we want to be lenient.
>>>>> i'm happy for this to be added to the high level DOM
>>>>>
>>>>>> I'm not saying that I want mime4j to support all of this before a
>>>>>> release, I
>>>>>> just want to understand if this is what you also expect and if this can
>>>>>> be
>>>>>> considered a common goal.
>>>>> i'm happy to address your concerns by adding conversion code into the
>>>>> higher level API layers but if mime4j seriously needs to compromise
>>>>> the low level API then i'm not sure i can use this library for my mail
>>>>> work either. in this case, i'd be happy to introduce a proposal for a
>>>>> performant low level pull parser for MIME to the commons instead.
>>>> I'm working on a solution having readLine methods not returning the
>>>> newline
>>>> chars so that the user of readLine does not need to care about line
>>>> delimiter.
>>>> This way we can tune the line delimiter inside the
>>>> BufferedLineReaderInputStream and not everywhere else.
>>> users of the low level API may well care about preservation of line
>>> endings
>> What exactly is part of the low level API?
> 
> the pull parser
> 
>> I'm not sure I understand how I preserve line endings in headers in the
>> current implementation.
> 
> the role of the parser is just to detect and parse the headers

is AbstractEntity part of the pull parser (low level api or not?)
My proposal does not change what you see from outside that class, it 
only changes the contract between that class and the 
LineReaderInputStream (the line ending stripping has been moved to the 
stream readLine method while previously it was in the AbstractEntity 
method just after readLine was called).

> AFAIK the DOM based API has never correctly preserved line endings in
> headers. i've though about this on occasion and the conclusion i've
> always reached is that this would be challenging to implement in a
> performant fashion. i think i'd approach design from a different
> direction: insist that the mail was stored on file, then use a memory
> mapped file and nio to avoid double buffering (rather than use the
> pull parser)

For a DOM access it sounds like a good plan. Of course the low level or 
SAX API are much better for filtering streams.

>>>> "Client code" for LineReaderInputStream should use readLine ONLY when
>>>> line
>>>> recognition is needed (as it already happen).
>>>>
>>>> I have already coded a solution doing this (and using only CRLF and LF as
>>>> line delimiters, like the current behaviour).
>>>>
>>>> I'm running a few tests, I'll probably create a JIRA and a proposal
>>>> tomorrow.
>>> this proposal seems likely to reduce the correctness and usefulness of
>>> the low level parser in order to address an issue in the high level
>>> API
>> I'm not sure how can we deal with CR-LF in a consistent way if we don't do
>> this at a low level, but maybe I'm missing something.
> 
> line endings can be handled consistently by canonicalising in the
> higher level APIs. the DOM API is not performant and IMO it would be
> perfectly acceptable to canonicalise line endings at this level.

I made a list of classes having to deal with line endings in the 
previous message.
I identified 6 of them: can you help me classify the 6 classes against 
what level they are against (DOM / SAX / low level pull parser).

> if canonicalisation is forced into the low level API then this would
> make mime4j unsuitable for more general use including some of the mail
> usages i'm interested in. if this is the consensus then i'm very happy
> to simple create a new library that is suitable for more general usage
> (including advanced mail) either here or somewhere else.

My change does not canonicalize anything: I simply changed the contract 
for readLine method. I'm not able to identify an use of the mime4j 
library that changes its behaviour after applying my proposed patch.

This does not mean that the patch itself is good, but I don't understand 
your argument: my code does not canonicalize anything, it does exactly 
the same it happen now. In fact the test results (excluding the specific 
readLine method call tests) didn't change their expected value.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Mon, Jul 21, 2008 at 10:11 AM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
>>
>> On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <ap...@bago.org> wrote:
>>>
>>> Robert Burrell Donkin ha scritto:
>>>>
>>>> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <ap...@bago.org>
>>>> wrote:
>>>>>
>>>>> Oleg Kalnichevski ha scritto:
>>>>>>
>>>>>> Robert Burrell Donkin wrote:
>>>>>>>
>>>>>>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>>>>>>>
>>>>>>>> Robert Burrell Donkin ha scritto:
>>>>>>>>>
>>>>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Robert Burrell Donkin ha scritto:
>>>>
>>>> <snip>
>>>>
>>>>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this
>>>>>>>> support
>>>>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>>>>> terminate with an isolated LF then also lines in text content should
>>>>>>>> do
>>>>>>>> the same (because probably the whole mime message has LF instead of
>>>>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message
>>>>>>>> is
>>>>>>>> encoded using LF instead of CRLF and that this specific encoding
>>>>>>>> breaks
>>>>>>>> binary parts, but we want to be smarter wrt this issue].
>>>>>>>
>>>>>>> TextBody is part of the DOM. This can and should be addressed there
>>>>>>> (rather than in the parser). I think that doing this should satisfy
>>>>>>> both needs without compromising the performance of the parser.
>>>>>>>
>>>>>> If this is indeed something we can all agree on, I can try to solve
>>>>>> the
>>>>>> first problem (strict/lenient line delimiter handling) using a
>>>>>> pluggable
>>>>>> strategy of some kind.
>>>>>>
>>>>>> Oleg
>>>>>
>>>>> My limited knowledge of mime4j details doesn't let me reply "+1". So I
>>>>> simply tell what I expect from mime4j as an user:
>>>>
>>>> it's important to understand that mime4j targets different kinds of
>>>> user. the pull parser is a low level application agnostic interface
>>>> aimed at experts who need performance. the DOM and SAX components are
>>>> higher level interfaces for less experience users who are willing to
>>>> compromise flexibility and performance. each user will have different
>>>> expectations.
>>>>
>>>>> Lenient line delimiter parsing:
>>>>> - consider isolated LF and CR in the mime stream as newlines as long as
>>>>> a
>>>>> newline concept exists in that specific place (everywhere but binary
>>>>> body
>>>>> parts having ContentTransferEncoding = "binary").
>>>>
>>>> the low level interface should allow the user to determine whether
>>>> they want to canonicalise. the higher level interface should probably
>>>> canonicalise.
>>>
>>> I have an alternative proposal, see the bottom of this message.
>>>
>>>>> - This means that a CR in a base64 stream is a newline, a CR in a
>>>>> text/plain
>>>>> is a newline, a "CR<boundary> CR" sequence is a valid multipart
>>>>> boundary,
>>>>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF"
>>>>> sequences
>>>>> are valid separators between header and body because they are
>>>>> considered
>>>>> as
>>>>> equivalent to "CRLFCRLF".
>>>>
>>>> i'm not sure i agree (i need to think about this a little more)
>>>
>>> Ok, let me know your doubts as you get them.
>>>
>>>>> - THis also means that writing in output this stuff will result in a
>>>>> mime
>>>>> stream with NO isolated CRs or LFs (unless they are in a "binary"
>>>>> encoded
>>>>> body).
>>>>
>>>> i'm happy for the high level DOM API to perform conversions on the
>>>> streams.
>>>>
>>>>> Strict line delimiter parsing (I don't care if we have this now, I just
>>>>> think we should have this in mind while factoring mime4j because it
>>>>> should
>>>>> be possible to implement this with no major changes).
>>>>
>>>> this is a non-goal as far as i'm concerned. performant validating
>>>> parsers tend to be more difficult to write. if a validating engine is
>>>> needed then i'd prefer to approach the design without preconditions. i
>>>> need a fast robust parser that is able to cope with practical MIME
>>>> documents whether they are valid or not.
>>>
>>> Ok, no one seems to care about strict parsing, so let's forget about this
>>> for now, but please let me understand this:
>>> I see mime4j already have a strict parsing concept about throwing
>>> exceptions
>>> vs monitor calls when it encounter malformed/unexpected content: what is
>>> the
>>> rationale for needing the current strict parsing while not needing the
>>> CRLF
>>> delimiter strict parsing?
>>>
>>>>> - LFs and CRs are not newlines, they are not considered newlines and
>>>>> results
>>>>> in errors raised by the parser (invalid header, invalid content, and so
>>>>> on)
>>>>> that will result in a parsing failure or (if the raised errors are
>>>>> ignored)
>>>>> in invalid DOM (I'm not sure how we currently handle this case for
>>>>> non-expected 8bit content in an header, but it should be the same).
>>>>> - writing in output this content should result in a well-formed
>>>>> content,
>>>>> so:
>>>>>  - if an LF in the header is somehow "encodable" as a valid sequence it
>>>>> should be parsed as LF and then encoded while outputting. If instead an
>>>>> LF
>>>>> in the header is not encodable then we should fail parsing or remove it
>>>>> (or
>>>>> convert it to "?" or anything similar) if we want to be lenient.
>>>>
>>>> i'm happy for this to be added to the high level DOM
>>>>
>>>>> I'm not saying that I want mime4j to support all of this before a
>>>>> release, I
>>>>> just want to understand if this is what you also expect and if this can
>>>>> be
>>>>> considered a common goal.
>>>>
>>>> i'm happy to address your concerns by adding conversion code into the
>>>> higher level API layers but if mime4j seriously needs to compromise
>>>> the low level API then i'm not sure i can use this library for my mail
>>>> work either. in this case, i'd be happy to introduce a proposal for a
>>>> performant low level pull parser for MIME to the commons instead.
>>>
>>> I'm working on a solution having readLine methods not returning the
>>> newline
>>> chars so that the user of readLine does not need to care about line
>>> delimiter.
>>> This way we can tune the line delimiter inside the
>>> BufferedLineReaderInputStream and not everywhere else.
>>
>> users of the low level API may well care about preservation of line
>> endings
>
> What exactly is part of the low level API?

the pull parser

> I'm not sure I understand how I preserve line endings in headers in the
> current implementation.

the role of the parser is just to detect and parse the headers

AFAIK the DOM based API has never correctly preserved line endings in
headers. i've though about this on occasion and the conclusion i've
always reached is that this would be challenging to implement in a
performant fashion. i think i'd approach design from a different
direction: insist that the mail was stored on file, then use a memory
mapped file and nio to avoid double buffering (rather than use the
pull parser)

>>> "Client code" for LineReaderInputStream should use readLine ONLY when
>>> line
>>> recognition is needed (as it already happen).
>>>
>>> I have already coded a solution doing this (and using only CRLF and LF as
>>> line delimiters, like the current behaviour).
>>>
>>> I'm running a few tests, I'll probably create a JIRA and a proposal
>>> tomorrow.
>>
>> this proposal seems likely to reduce the correctness and usefulness of
>> the low level parser in order to address an issue in the high level
>> API
>
> I'm not sure how can we deal with CR-LF in a consistent way if we don't do
> this at a low level, but maybe I'm missing something.

line endings can be handled consistently by canonicalising in the
higher level APIs. the DOM API is not performant and IMO it would be
perfectly acceptable to canonicalise line endings at this level.

if canonicalisation is forced into the low level API then this would
make mime4j unsuitable for more general use including some of the mail
usages i'm interested in. if this is the consensus then i'm very happy
to simple create a new library that is suitable for more general usage
(including advanced mail) either here or somewhere else.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <ap...@bago.org> wrote:
>> Robert Burrell Donkin ha scritto:
>>> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <ap...@bago.org> wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> Robert Burrell Donkin wrote:
>>>>>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>>>>>> Robert Burrell Donkin ha scritto:
>>>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org>
>>>>>>>> wrote:
>>>>>>>>> Robert Burrell Donkin ha scritto:
>>> <snip>
>>>
>>>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>>>> terminate with an isolated LF then also lines in text content should
>>>>>>> do
>>>>>>> the same (because probably the whole mime message has LF instead of
>>>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>>>>> encoded using LF instead of CRLF and that this specific encoding
>>>>>>> breaks
>>>>>>> binary parts, but we want to be smarter wrt this issue].
>>>>>> TextBody is part of the DOM. This can and should be addressed there
>>>>>> (rather than in the parser). I think that doing this should satisfy
>>>>>> both needs without compromising the performance of the parser.
>>>>>>
>>>>> If this is indeed something we can all agree on, I can try to solve the
>>>>> first problem (strict/lenient line delimiter handling) using a pluggable
>>>>> strategy of some kind.
>>>>>
>>>>> Oleg
>>>> My limited knowledge of mime4j details doesn't let me reply "+1". So I
>>>> simply tell what I expect from mime4j as an user:
>>> it's important to understand that mime4j targets different kinds of
>>> user. the pull parser is a low level application agnostic interface
>>> aimed at experts who need performance. the DOM and SAX components are
>>> higher level interfaces for less experience users who are willing to
>>> compromise flexibility and performance. each user will have different
>>> expectations.
>>>
>>>> Lenient line delimiter parsing:
>>>> - consider isolated LF and CR in the mime stream as newlines as long as a
>>>> newline concept exists in that specific place (everywhere but binary body
>>>> parts having ContentTransferEncoding = "binary").
>>> the low level interface should allow the user to determine whether
>>> they want to canonicalise. the higher level interface should probably
>>> canonicalise.
>> I have an alternative proposal, see the bottom of this message.
>>
>>>> - This means that a CR in a base64 stream is a newline, a CR in a
>>>> text/plain
>>>> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary,
>>>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF"
>>>> sequences
>>>> are valid separators between header and body because they are considered
>>>> as
>>>> equivalent to "CRLFCRLF".
>>> i'm not sure i agree (i need to think about this a little more)
>> Ok, let me know your doubts as you get them.
>>
>>>> - THis also means that writing in output this stuff will result in a mime
>>>> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded
>>>> body).
>>> i'm happy for the high level DOM API to perform conversions on the
>>> streams.
>>>
>>>> Strict line delimiter parsing (I don't care if we have this now, I just
>>>> think we should have this in mind while factoring mime4j because it
>>>> should
>>>> be possible to implement this with no major changes).
>>> this is a non-goal as far as i'm concerned. performant validating
>>> parsers tend to be more difficult to write. if a validating engine is
>>> needed then i'd prefer to approach the design without preconditions. i
>>> need a fast robust parser that is able to cope with practical MIME
>>> documents whether they are valid or not.
>> Ok, no one seems to care about strict parsing, so let's forget about this
>> for now, but please let me understand this:
>> I see mime4j already have a strict parsing concept about throwing exceptions
>> vs monitor calls when it encounter malformed/unexpected content: what is the
>> rationale for needing the current strict parsing while not needing the CRLF
>> delimiter strict parsing?
>>
>>>> - LFs and CRs are not newlines, they are not considered newlines and
>>>> results
>>>> in errors raised by the parser (invalid header, invalid content, and so
>>>> on)
>>>> that will result in a parsing failure or (if the raised errors are
>>>> ignored)
>>>> in invalid DOM (I'm not sure how we currently handle this case for
>>>> non-expected 8bit content in an header, but it should be the same).
>>>> - writing in output this content should result in a well-formed content,
>>>> so:
>>>>  - if an LF in the header is somehow "encodable" as a valid sequence it
>>>> should be parsed as LF and then encoded while outputting. If instead an
>>>> LF
>>>> in the header is not encodable then we should fail parsing or remove it
>>>> (or
>>>> convert it to "?" or anything similar) if we want to be lenient.
>>> i'm happy for this to be added to the high level DOM
>>>
>>>> I'm not saying that I want mime4j to support all of this before a
>>>> release, I
>>>> just want to understand if this is what you also expect and if this can
>>>> be
>>>> considered a common goal.
>>> i'm happy to address your concerns by adding conversion code into the
>>> higher level API layers but if mime4j seriously needs to compromise
>>> the low level API then i'm not sure i can use this library for my mail
>>> work either. in this case, i'd be happy to introduce a proposal for a
>>> performant low level pull parser for MIME to the commons instead.
>> I'm working on a solution having readLine methods not returning the newline
>> chars so that the user of readLine does not need to care about line
>> delimiter.
>> This way we can tune the line delimiter inside the
>> BufferedLineReaderInputStream and not everywhere else.
> 
> users of the low level API may well care about preservation of line endings

What exactly is part of the low level API?
I'm not sure I understand how I preserve line endings in headers in the 
current implementation.

>> "Client code" for LineReaderInputStream should use readLine ONLY when line
>> recognition is needed (as it already happen).
>>
>> I have already coded a solution doing this (and using only CRLF and LF as
>> line delimiters, like the current behaviour).
>>
>> I'm running a few tests, I'll probably create a JIRA and a proposal
>> tomorrow.
> 
> this proposal seems likely to reduce the correctness and usefulness of
> the low level parser in order to address an issue in the high level
> API

I'm not sure how can we deal with CR-LF in a consistent way if we don't 
do this at a low level, but maybe I'm missing something.
What do you propose should every component in mime4j have the ability to 
deal with malformed line endings? Should we make each of them configurable?

ATM I could this places where we deal with newlines:
1) RootInputStream: count lines only when CRLF is found.
2) BufferedLinedReaderInputStream: readLine return byte sequences ending 
with LF
3) AbstractEntity: during header parsing ending LF or ending CRLF is 
stripped out from lines.
4) QuotedPrintableInputStream deals with isolated \r and \n (not sure 
HOW it consider them) and log warning each time they are found isolated.
5) MimeUtil.getHeaderParams consider isolated CR and isolated LF as 
newlines (ignoring them) (so that CR is alone is not returned in an 
header value)
6) MimeBoundaryInputStream.calculateBoundaryLen seems to strip ending \n 
always and \r only if it is before a \n or it is the boundary length is 
one more char (BUG? it doesn't recalculate let after the first if)

I don't know if in the javacc/jjtree code we have more CR/LF logic.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
>>
>> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <ap...@bago.org> wrote:
>>>
>>> Oleg Kalnichevski ha scritto:
>>>>
>>>> Robert Burrell Donkin wrote:
>>>>>
>>>>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>>>>>
>>>>>> Robert Burrell Donkin ha scritto:
>>>>>>>
>>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Robert Burrell Donkin ha scritto:
>>
>> <snip>
>>
>>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>>> terminate with an isolated LF then also lines in text content should
>>>>>> do
>>>>>> the same (because probably the whole mime message has LF instead of
>>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>>>> encoded using LF instead of CRLF and that this specific encoding
>>>>>> breaks
>>>>>> binary parts, but we want to be smarter wrt this issue].
>>>>>
>>>>> TextBody is part of the DOM. This can and should be addressed there
>>>>> (rather than in the parser). I think that doing this should satisfy
>>>>> both needs without compromising the performance of the parser.
>>>>>
>>>> If this is indeed something we can all agree on, I can try to solve the
>>>> first problem (strict/lenient line delimiter handling) using a pluggable
>>>> strategy of some kind.
>>>>
>>>> Oleg
>>>
>>> My limited knowledge of mime4j details doesn't let me reply "+1". So I
>>> simply tell what I expect from mime4j as an user:
>>
>> it's important to understand that mime4j targets different kinds of
>> user. the pull parser is a low level application agnostic interface
>> aimed at experts who need performance. the DOM and SAX components are
>> higher level interfaces for less experience users who are willing to
>> compromise flexibility and performance. each user will have different
>> expectations.
>>
>>> Lenient line delimiter parsing:
>>> - consider isolated LF and CR in the mime stream as newlines as long as a
>>> newline concept exists in that specific place (everywhere but binary body
>>> parts having ContentTransferEncoding = "binary").
>>
>> the low level interface should allow the user to determine whether
>> they want to canonicalise. the higher level interface should probably
>> canonicalise.
>
> I have an alternative proposal, see the bottom of this message.
>
>>> - This means that a CR in a base64 stream is a newline, a CR in a
>>> text/plain
>>> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary,
>>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF"
>>> sequences
>>> are valid separators between header and body because they are considered
>>> as
>>> equivalent to "CRLFCRLF".
>>
>> i'm not sure i agree (i need to think about this a little more)
>
> Ok, let me know your doubts as you get them.
>
>>> - THis also means that writing in output this stuff will result in a mime
>>> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded
>>> body).
>>
>> i'm happy for the high level DOM API to perform conversions on the
>> streams.
>>
>>> Strict line delimiter parsing (I don't care if we have this now, I just
>>> think we should have this in mind while factoring mime4j because it
>>> should
>>> be possible to implement this with no major changes).
>>
>> this is a non-goal as far as i'm concerned. performant validating
>> parsers tend to be more difficult to write. if a validating engine is
>> needed then i'd prefer to approach the design without preconditions. i
>> need a fast robust parser that is able to cope with practical MIME
>> documents whether they are valid or not.
>
> Ok, no one seems to care about strict parsing, so let's forget about this
> for now, but please let me understand this:
> I see mime4j already have a strict parsing concept about throwing exceptions
> vs monitor calls when it encounter malformed/unexpected content: what is the
> rationale for needing the current strict parsing while not needing the CRLF
> delimiter strict parsing?
>
>>> - LFs and CRs are not newlines, they are not considered newlines and
>>> results
>>> in errors raised by the parser (invalid header, invalid content, and so
>>> on)
>>> that will result in a parsing failure or (if the raised errors are
>>> ignored)
>>> in invalid DOM (I'm not sure how we currently handle this case for
>>> non-expected 8bit content in an header, but it should be the same).
>>> - writing in output this content should result in a well-formed content,
>>> so:
>>>  - if an LF in the header is somehow "encodable" as a valid sequence it
>>> should be parsed as LF and then encoded while outputting. If instead an
>>> LF
>>> in the header is not encodable then we should fail parsing or remove it
>>> (or
>>> convert it to "?" or anything similar) if we want to be lenient.
>>
>> i'm happy for this to be added to the high level DOM
>>
>>> I'm not saying that I want mime4j to support all of this before a
>>> release, I
>>> just want to understand if this is what you also expect and if this can
>>> be
>>> considered a common goal.
>>
>> i'm happy to address your concerns by adding conversion code into the
>> higher level API layers but if mime4j seriously needs to compromise
>> the low level API then i'm not sure i can use this library for my mail
>> work either. in this case, i'd be happy to introduce a proposal for a
>> performant low level pull parser for MIME to the commons instead.
>
> I'm working on a solution having readLine methods not returning the newline
> chars so that the user of readLine does not need to care about line
> delimiter.
> This way we can tune the line delimiter inside the
> BufferedLineReaderInputStream and not everywhere else.

users of the low level API may well care about preservation of line endings

> "Client code" for LineReaderInputStream should use readLine ONLY when line
> recognition is needed (as it already happen).
>
> I have already coded a solution doing this (and using only CRLF and LF as
> line delimiters, like the current behaviour).
>
> I'm running a few tests, I'll probably create a JIRA and a proposal
> tomorrow.

this proposal seems likely to reduce the correctness and usefulness of
the low level parser in order to address an issue in the high level
API

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <ap...@bago.org> wrote:
>> Oleg Kalnichevski ha scritto:
>>> Robert Burrell Donkin wrote:
>>>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>>>> Robert Burrell Donkin ha scritto:
>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org>
>>>>>> wrote:
>>>>>>> Robert Burrell Donkin ha scritto:
> 
> <snip>
> 
>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>> terminate with an isolated LF then also lines in text content should do
>>>>> the same (because probably the whole mime message has LF instead of
>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>>> encoded using LF instead of CRLF and that this specific encoding breaks
>>>>> binary parts, but we want to be smarter wrt this issue].
>>>> TextBody is part of the DOM. This can and should be addressed there
>>>> (rather than in the parser). I think that doing this should satisfy
>>>> both needs without compromising the performance of the parser.
>>>>
>>> If this is indeed something we can all agree on, I can try to solve the
>>> first problem (strict/lenient line delimiter handling) using a pluggable
>>> strategy of some kind.
>>>
>>> Oleg
>> My limited knowledge of mime4j details doesn't let me reply "+1". So I
>> simply tell what I expect from mime4j as an user:
> 
> it's important to understand that mime4j targets different kinds of
> user. the pull parser is a low level application agnostic interface
> aimed at experts who need performance. the DOM and SAX components are
> higher level interfaces for less experience users who are willing to
> compromise flexibility and performance. each user will have different
> expectations.
> 
>> Lenient line delimiter parsing:
>> - consider isolated LF and CR in the mime stream as newlines as long as a
>> newline concept exists in that specific place (everywhere but binary body
>> parts having ContentTransferEncoding = "binary").
> 
> the low level interface should allow the user to determine whether
> they want to canonicalise. the higher level interface should probably
> canonicalise.

I have an alternative proposal, see the bottom of this message.

>> - This means that a CR in a base64 stream is a newline, a CR in a text/plain
>> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary,
>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF" sequences
>> are valid separators between header and body because they are considered as
>> equivalent to "CRLFCRLF".
> 
> i'm not sure i agree (i need to think about this a little more)

Ok, let me know your doubts as you get them.

>> - THis also means that writing in output this stuff will result in a mime
>> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded
>> body).
> 
> i'm happy for the high level DOM API to perform conversions on the streams.
> 
>> Strict line delimiter parsing (I don't care if we have this now, I just
>> think we should have this in mind while factoring mime4j because it should
>> be possible to implement this with no major changes).
> 
> this is a non-goal as far as i'm concerned. performant validating
> parsers tend to be more difficult to write. if a validating engine is
> needed then i'd prefer to approach the design without preconditions. i
> need a fast robust parser that is able to cope with practical MIME
> documents whether they are valid or not.

Ok, no one seems to care about strict parsing, so let's forget about 
this for now, but please let me understand this:
I see mime4j already have a strict parsing concept about throwing 
exceptions vs monitor calls when it encounter malformed/unexpected 
content: what is the rationale for needing the current strict parsing 
while not needing the CRLF delimiter strict parsing?

>> - LFs and CRs are not newlines, they are not considered newlines and results
>> in errors raised by the parser (invalid header, invalid content, and so on)
>> that will result in a parsing failure or (if the raised errors are ignored)
>> in invalid DOM (I'm not sure how we currently handle this case for
>> non-expected 8bit content in an header, but it should be the same).
>> - writing in output this content should result in a well-formed content, so:
>>  - if an LF in the header is somehow "encodable" as a valid sequence it
>> should be parsed as LF and then encoded while outputting. If instead an LF
>> in the header is not encodable then we should fail parsing or remove it (or
>> convert it to "?" or anything similar) if we want to be lenient.
> 
> i'm happy for this to be added to the high level DOM
> 
>> I'm not saying that I want mime4j to support all of this before a release, I
>> just want to understand if this is what you also expect and if this can be
>> considered a common goal.
> 
> i'm happy to address your concerns by adding conversion code into the
> higher level API layers but if mime4j seriously needs to compromise
> the low level API then i'm not sure i can use this library for my mail
> work either. in this case, i'd be happy to introduce a proposal for a
> performant low level pull parser for MIME to the commons instead.

I'm working on a solution having readLine methods not returning the 
newline chars so that the user of readLine does not need to care about 
line delimiter.
This way we can tune the line delimiter inside the 
BufferedLineReaderInputStream and not everywhere else.

"Client code" for LineReaderInputStream should use readLine ONLY when 
line recognition is needed (as it already happen).

I have already coded a solution doing this (and using only CRLF and LF 
as line delimiters, like the current behaviour).

I'm running a few tests, I'll probably create a JIRA and a proposal 
tomorrow.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Oleg Kalnichevski ha scritto:
>>
>> Robert Burrell Donkin wrote:
>>>
>>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>>>
>>>> Robert Burrell Donkin ha scritto:
>>>>>
>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org>
>>>>> wrote:
>>>>>>
>>>>>> Robert Burrell Donkin ha scritto:

<snip>

>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>> newlines like the header/structure parser: if headers are allowed to
>>>> terminate with an isolated LF then also lines in text content should do
>>>> the same (because probably the whole mime message has LF instead of
>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>> encoded using LF instead of CRLF and that this specific encoding breaks
>>>> binary parts, but we want to be smarter wrt this issue].
>>>
>>> TextBody is part of the DOM. This can and should be addressed there
>>> (rather than in the parser). I think that doing this should satisfy
>>> both needs without compromising the performance of the parser.
>>>
>>
>> If this is indeed something we can all agree on, I can try to solve the
>> first problem (strict/lenient line delimiter handling) using a pluggable
>> strategy of some kind.
>>
>> Oleg
>
> My limited knowledge of mime4j details doesn't let me reply "+1". So I
> simply tell what I expect from mime4j as an user:

it's important to understand that mime4j targets different kinds of
user. the pull parser is a low level application agnostic interface
aimed at experts who need performance. the DOM and SAX components are
higher level interfaces for less experience users who are willing to
compromise flexibility and performance. each user will have different
expectations.

> Lenient line delimiter parsing:
> - consider isolated LF and CR in the mime stream as newlines as long as a
> newline concept exists in that specific place (everywhere but binary body
> parts having ContentTransferEncoding = "binary").

the low level interface should allow the user to determine whether
they want to canonicalise. the higher level interface should probably
canonicalise.

> - This means that a CR in a base64 stream is a newline, a CR in a text/plain
> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary,
> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF" sequences
> are valid separators between header and body because they are considered as
> equivalent to "CRLFCRLF".

i'm not sure i agree (i need to think about this a little more)

> - THis also means that writing in output this stuff will result in a mime
> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded
> body).

i'm happy for the high level DOM API to perform conversions on the streams.

> Strict line delimiter parsing (I don't care if we have this now, I just
> think we should have this in mind while factoring mime4j because it should
> be possible to implement this with no major changes).

this is a non-goal as far as i'm concerned. performant validating
parsers tend to be more difficult to write. if a validating engine is
needed then i'd prefer to approach the design without preconditions. i
need a fast robust parser that is able to cope with practical MIME
documents whether they are valid or not.

> - LFs and CRs are not newlines, they are not considered newlines and results
> in errors raised by the parser (invalid header, invalid content, and so on)
> that will result in a parsing failure or (if the raised errors are ignored)
> in invalid DOM (I'm not sure how we currently handle this case for
> non-expected 8bit content in an header, but it should be the same).
> - writing in output this content should result in a well-formed content, so:
>  - if an LF in the header is somehow "encodable" as a valid sequence it
> should be parsed as LF and then encoded while outputting. If instead an LF
> in the header is not encodable then we should fail parsing or remove it (or
> convert it to "?" or anything similar) if we want to be lenient.

i'm happy for this to be added to the high level DOM

> I'm not saying that I want mime4j to support all of this before a release, I
> just want to understand if this is what you also expect and if this can be
> considered a common goal.

i'm happy to address your concerns by adding conversion code into the
higher level API layers but if mime4j seriously needs to compromise
the low level API then i'm not sure i can use this library for my mail
work either. in this case, i'd be happy to introduce a proposal for a
performant low level pull parser for MIME to the commons instead.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Sat, 2008-07-19 at 18:29 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>> ...
>>>
> 
> ...
> 
>>>
>>> <rant disclaimer="please ignore">
>>>
>>> HttpComponents project chose to depend on mime4j instead of developing a
>>> similar solution because we thought it was the right thing to do. We
>>> thought we should rather contribute to an existing project instead of
>>> pursuing competing efforts for which we have neither resources nor the
>>> right expertise.
>>> As a result we had to delay the next release of HttpClient by almost two
>>> months waiting for a mime4j release. I do not see a point in waiting any
>>> longer. I see no other way but dropping dependency on mime4j, at least
>>> temporarily.
>>> I did my very best to resolve an old problem no one seemed eager to work
>>> on for a year and a half. I will happily continue to contribute to this
>>> project to my best abilities, but in this particular case I see no
>>> justification to investing any more time in trying to satisfy someone
>>> else's wish list.
>>> 
>>> </rant>
>>
>> Please note that I never expected you to satisfy any wishlist. I 
>> simply opened the project and tested some code and I found issues. I 
>> simply reported issues. I also tried to help with the repackaging 
>> because it is something that I always found interesting (I also wrote 
>> a tool that automatically try to make automatic package classification 
>> based on dependencies and some metric).
>>
>> I'm not (I never did) asking you to solve any of mime4j issues (my 
>> last sentence you quoted is "I'm not saying that I want mime4j to 
>> support all of this before a release, I just want to understand if 
>> this is what you also expect and if this can be considered a common 
>> goal..": I'm not sure I understand your rant and why you think I'm (or 
>> someone else) asking you to do something. In fact we even voted to 
>> make you committer to this project so you could have worked on the 
>> code without waiting for our limited time to review/apply patches.
>>
>> Most of the issues I opened against Mime4j are simply there because 
>> they need attention: there is no need to solve them in order to make a 
>> release.
>>
>> Some other issue instead require attention (e.g: the quoted printable 
>> stuff no more being decoded), but this is not something we are asking 
>> you to solve. I'm not sure when I'll find the time but I plan to try 
>> to understand when this has been broken and how to fix it.
>>
>> Infinite loops and other OOM issues are there: I think I'm not the 
>> cause of them, and I understand that is frustrating for you to find 
>> critical issues in a library you introduced in your component but 
>> please think twice to this.
>>
>> Most of this issues are regression against old mime4j versions and 
>> Niklas did a good job in givin trunk a go and testing it in his 
>> environment.
>>
>> I hope you understand I'm propositive in this discussion and I'm 
>> simply trying to understand the common goal so that we don't break 
>> each other work.
>>
> 
> (1) Replacing a simple algorithm with a much more complex one is usually 
> bound to produce regressions.

Sure.

> (2) when the refactoring was completed _all_ existing tests passed. If 
> there had been more tests I would have ensured they all passed as well. 
> OOM issues were present in the old implementation. Handling of binary 
> content was broken before me. It is somewhat unfair to blame me for 
> breaking functionality that was not covered by test cases.

No one blamed you because of this regressions or because of missing 
tests. We are here to collaborate and make things better. No one blamed 
you of the OOM or the infinite loops.
I simply reported them to JIRA as I found them. I think it is better to 
have opened JIRAs than ignoring issues.
Are you blaming me because I opened JIRA issues??

> (3) I was quietly working on fixing reported issues and regressions 
> until we got stuck in this 'discussion' about line delimiters. It 
> dragged on for days without producing any practical outcome and was 
> mostly about pointless stuff.

No one ever asked you to implement any specific line delimiter handling. 
In fact I'm willing to code the solution myself once it is clear what is 
the correct approach for a similar library and what kind of options we 
need to make it simple but flexible as needed.
If you are willing to partecipate to this discussion you are very 
welcome, but if this discussion scare you simply ignore it and limit 
yourself describing the behaviour you need in httpcomponent. I don't 
really expect you accomodate any request from me (or us) or to satisfy 
our wishlist. Never did that.

> I will happily continue fixing known regressions, but I feel there is 
> nothing else I could contribute to this issue. The best thing to do now 
> is for me to step aside and let you implement whatever solution you feel 
> appropriate.

This sounds acceptable. You described HTTP needs now (but to correctly 
understand your needs we had to have this discussion, otherwise it was 
not clear to me, and probably others, that you need LF parsed as newline 
and CR ignored and no other option is valid) so we can continue this 
discussion (with or without you) and keep your requirement in consideration.

Please just tell me if you are working on any opened JIRA so that we 
don't work on the same issue.

>> That's fine. Just let me understand: you wouldn't like to have mime4j 
>> parsing also CR as newline, right?
> 
> I personally would prefer to be able to configure mime4j to do that, 
> because this would be consistent with the behavior of HTTP components, 
> but can live with CR treated as newlines.
> 
>  What do you expect from mime4j when a
>> CR is found around the mime stream?
> 
> Ignore it

Thank you,
Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
>> On Sat, 2008-07-19 at 18:29 +0200, Stefano Bagnara wrote:
>>> Oleg Kalnichevski ha scritto:
>> ...
>>

...

>>
>> <rant disclaimer="please ignore">
>>
>> HttpComponents project chose to depend on mime4j instead of developing a
>> similar solution because we thought it was the right thing to do. We
>> thought we should rather contribute to an existing project instead of
>> pursuing competing efforts for which we have neither resources nor the
>> right expertise.
>> As a result we had to delay the next release of HttpClient by almost two
>> months waiting for a mime4j release. I do not see a point in waiting any
>> longer. I see no other way but dropping dependency on mime4j, at least
>> temporarily.
>> I did my very best to resolve an old problem no one seemed eager to work
>> on for a year and a half. I will happily continue to contribute to this
>> project to my best abilities, but in this particular case I see no
>> justification to investing any more time in trying to satisfy someone
>> else's wish list.
>> 
>> </rant>
> 
> Please note that I never expected you to satisfy any wishlist. I simply 
> opened the project and tested some code and I found issues. I simply 
> reported issues. I also tried to help with the repackaging because it is 
> something that I always found interesting (I also wrote a tool that 
> automatically try to make automatic package classification based on 
> dependencies and some metric).
> 
> I'm not (I never did) asking you to solve any of mime4j issues (my last 
> sentence you quoted is "I'm not saying that I want mime4j to support all 
> of this before a release, I just want to understand if this is what you 
> also expect and if this can be considered a common goal..": I'm not sure 
> I understand your rant and why you think I'm (or someone else) asking 
> you to do something. In fact we even voted to make you committer to this 
> project so you could have worked on the code without waiting for our 
> limited time to review/apply patches.
> 
> Most of the issues I opened against Mime4j are simply there because they 
> need attention: there is no need to solve them in order to make a release.
> 
> Some other issue instead require attention (e.g: the quoted printable 
> stuff no more being decoded), but this is not something we are asking 
> you to solve. I'm not sure when I'll find the time but I plan to try to 
> understand when this has been broken and how to fix it.
> 
> Infinite loops and other OOM issues are there: I think I'm not the cause 
> of them, and I understand that is frustrating for you to find critical 
> issues in a library you introduced in your component but please think 
> twice to this.
> 
> Most of this issues are regression against old mime4j versions and 
> Niklas did a good job in givin trunk a go and testing it in his 
> environment.
> 
> I hope you understand I'm propositive in this discussion and I'm simply 
> trying to understand the common goal so that we don't break each other 
> work.
> 

(1) Replacing a simple algorithm with a much more complex one is usually 
bound to produce regressions.

(2) when the refactoring was completed _all_ existing tests passed. If 
there had been more tests I would have ensured they all passed as well. 
OOM issues were present in the old implementation. Handling of binary 
content was broken before me. It is somewhat unfair to blame me for 
breaking functionality that was not covered by test cases.

(3) I was quietly working on fixing reported issues and regressions 
until we got stuck in this 'discussion' about line delimiters. It 
dragged on for days without producing any practical outcome and was 
mostly about pointless stuff.

I will happily continue fixing known regressions, but I feel there is 
nothing else I could contribute to this issue. The best thing to do now 
is for me to step aside and let you implement whatever solution you feel 
appropriate.

> 
> That's fine. Just let me understand: you wouldn't like to have mime4j 
> parsing also CR as newline, right?

I personally would prefer to be able to configure mime4j to do that, 
because this would be consistent with the behavior of HTTP components, 
but can live with CR treated as newlines.

  What do you expect from mime4j when a
> CR is found around the mime stream?
> 

Ignore it

Oleg

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Sat, 2008-07-19 at 18:29 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
> ...
> 
>>>> Yes
>>>>> IMHO, it's clear we'll never be able to
>>>>> alter malformed mime content while preserving the malformations, so we
>>>>> have to think that in output we always have to create a canonical mime
>>>>> message. This is currently not the case, but this is the minor of my
>>>>> concern (because it is easier to fix, I think).
>>>> So I think there's rough consensus that writing the DOM should
>>>> canonicalise. Yes, I agree that this can be accomodated by altering
>>>> the DOM writer.
>>>>> So the issue is also during parsing:
>>>>>
>>>>> 1) we now have special treatment for isolated LF, we do not have
>>>>> something similar for CR (AFAIK both are special end of line delimiters
>>>>> used in some specific platform and not compliant to the canonical mime
>>>>> format, so I think we *should* support both special chars (in a lenient
>>>>> parsing).
>>>> If this logic can be acommodated easily then it sounds like we
>>>> probably should unless there are good reasons not to
>>>>
>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>> terminate with an isolated LF then also lines in text content should do
>>>>> the same (because probably the whole mime message has LF instead of
>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>>> encoded using LF instead of CRLF and that this specific encoding breaks
>>>>> binary parts, but we want to be smarter wrt this issue].
>>>> TextBody is part of the DOM. This can and should be addressed there
>>>> (rather than in the parser). I think that doing this should satisfy
>>>> both needs without compromising the performance of the parser.
>>>>
>>> If this is indeed something we can all agree on, I can try to solve the 
>>> first problem (strict/lenient line delimiter handling) using a pluggable 
>>> strategy of some kind.
>>>
>>> Oleg
>> My limited knowledge of mime4j details doesn't let me reply "+1". So I 
>> simply tell what I expect from mime4j as an user:
>>
>> Lenient line delimiter parsing:
>> - consider isolated LF and CR in the mime stream as newlines as long as 
>> a newline concept exists in that specific place (everywhere but binary 
>> body parts having ContentTransferEncoding = "binary").
>> - This means that a CR in a base64 stream is a newline, a CR in a 
>> text/plain is a newline, a "CR<boundary> CR" sequence is a valid 
>> multipart boundary, "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", 
>> "CRCR" or "LFLF" sequences are valid separators between header and body 
>> because they are considered as equivalent to "CRLFCRLF".
>> - THis also means that writing in output this stuff will result in a 
>> mime stream with NO isolated CRs or LFs (unless they are in a "binary" 
>> encoded body).
>>
>> Strict line delimiter parsing (I don't care if we have this now, I just 
>> think we should have this in mind while factoring mime4j because it 
>> should be possible to implement this with no major changes).
>> - LFs and CRs are not newlines, they are not considered newlines and 
>> results in errors raised by the parser (invalid header, invalid content, 
>> and so on) that will result in a parsing failure or (if the raised 
>> errors are ignored) in invalid DOM (I'm not sure how we currently handle 
>> this case for non-expected 8bit content in an header, but it should be 
>> the same).
>> - writing in output this content should result in a well-formed content, so:
>>    - if an LF in the header is somehow "encodable" as a valid sequence 
>> it should be parsed as LF and then encoded while outputting. If instead 
>> an LF in the header is not encodable then we should fail parsing or 
>> remove it (or convert it to "?" or anything similar) if we want to be 
>> lenient.
>>
>> I'm not saying that I want mime4j to support all of this before a 
>> release, I just want to understand if this is what you also expect and 
>> if this can be considered a common goal.
>>
>> Stefano
> 
> <rant disclaimer="please ignore">
> 
> HttpComponents project chose to depend on mime4j instead of developing a
> similar solution because we thought it was the right thing to do. We
> thought we should rather contribute to an existing project instead of
> pursuing competing efforts for which we have neither resources nor the
> right expertise. 
> 
> As a result we had to delay the next release of HttpClient by almost two
> months waiting for a mime4j release. I do not see a point in waiting any
> longer. I see no other way but dropping dependency on mime4j, at least
> temporarily. 
> 
> I did my very best to resolve an old problem no one seemed eager to work
> on for a year and a half. I will happily continue to contribute to this
> project to my best abilities, but in this particular case I see no
> justification to investing any more time in trying to satisfy someone
> else's wish list.
> 
> </rant>

Please note that I never expected you to satisfy any wishlist. I simply 
opened the project and tested some code and I found issues. I simply 
reported issues. I also tried to help with the repackaging because it is 
something that I always found interesting (I also wrote a tool that 
automatically try to make automatic package classification based on 
dependencies and some metric).

I'm not (I never did) asking you to solve any of mime4j issues (my last 
sentence you quoted is "I'm not saying that I want mime4j to support all 
of this before a release, I just want to understand if this is what you 
also expect and if this can be considered a common goal..": I'm not sure 
I understand your rant and why you think I'm (or someone else) asking 
you to do something. In fact we even voted to make you committer to this 
project so you could have worked on the code without waiting for our 
limited time to review/apply patches.

Most of the issues I opened against Mime4j are simply there because they 
need attention: there is no need to solve them in order to make a release.

Some other issue instead require attention (e.g: the quoted printable 
stuff no more being decoded), but this is not something we are asking 
you to solve. I'm not sure when I'll find the time but I plan to try to 
understand when this has been broken and how to fix it.

Infinite loops and other OOM issues are there: I think I'm not the cause 
of them, and I understand that is frustrating for you to find critical 
issues in a library you introduced in your component but please think 
twice to this.

Most of this issues are regression against old mime4j versions and 
Niklas did a good job in givin trunk a go and testing it in his environment.

I hope you understand I'm propositive in this discussion and I'm simply 
trying to understand the common goal so that we don't break each other work.

I'm even willing to code myself the solution once we agree on the 
expected behaviour: it simply does not worth the time of anyone if we 
simply commit code satisfying one specific need while breaking previous 
behaviours (e.g: for the specific use in a SMTP environment current 
trunk is much more fast but have many more issues than the last release. 
I don't know if this also apply to other protocols: e.g: do you need 
quoted printable decoding in HTTP?)

> Stefano,
> 
> (1) I _personally_ see the strict handling of line delimiters as
> _completely_ and _utterly_ pointless

I hope you understood that I'm not saying that we should delay a mime4j 
release for any of the issues I'm discussing here and also that I don't 
think that a strict handling is a needed feature, but only that is would 
be a desiderable option for a similar library.

> (2) I encountered only two types of line delimiters in HTTP messages in
> the wild: <CRLF> and <LF> (often mixed in the same message). I cannot
> recall seeing messages where <CR> was used as a line delimiter. I can
> live with any solution that enables me to configure mime4j to parse
> messages where both <CRLF> and <LF> can be used as line delimiters.

That's fine. Just let me understand: you wouldn't like to have mime4j 
parsing also CR as newline, right? What do you expect from mime4j when a 
CR is found around the mime stream?

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2008-07-19 at 18:29 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
...

> >> Yes
> >>> IMHO, it's clear we'll never be able to
> >>> alter malformed mime content while preserving the malformations, so we
> >>> have to think that in output we always have to create a canonical mime
> >>> message. This is currently not the case, but this is the minor of my
> >>> concern (because it is easier to fix, I think).
> >>
> >> So I think there's rough consensus that writing the DOM should
> >> canonicalise. Yes, I agree that this can be accomodated by altering
> >> the DOM writer.
> >>> So the issue is also during parsing:
> >>>
> >>> 1) we now have special treatment for isolated LF, we do not have
> >>> something similar for CR (AFAIK both are special end of line delimiters
> >>> used in some specific platform and not compliant to the canonical mime
> >>> format, so I think we *should* support both special chars (in a lenient
> >>> parsing).
> >> If this logic can be acommodated easily then it sounds like we
> >> probably should unless there are good reasons not to
> >>
> >>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
> >>> the "line" concept: I do expect this one to treat "non canonical"
> >>> newlines like the header/structure parser: if headers are allowed to
> >>> terminate with an isolated LF then also lines in text content should do
> >>> the same (because probably the whole mime message has LF instead of
> >>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
> >>> encoded using LF instead of CRLF and that this specific encoding breaks
> >>> binary parts, but we want to be smarter wrt this issue].
> >>
> >> TextBody is part of the DOM. This can and should be addressed there
> >> (rather than in the parser). I think that doing this should satisfy
> >> both needs without compromising the performance of the parser.
> >>
> > 
> > If this is indeed something we can all agree on, I can try to solve the 
> > first problem (strict/lenient line delimiter handling) using a pluggable 
> > strategy of some kind.
> > 
> > Oleg
> 
> My limited knowledge of mime4j details doesn't let me reply "+1". So I 
> simply tell what I expect from mime4j as an user:
> 
> Lenient line delimiter parsing:
> - consider isolated LF and CR in the mime stream as newlines as long as 
> a newline concept exists in that specific place (everywhere but binary 
> body parts having ContentTransferEncoding = "binary").
> - This means that a CR in a base64 stream is a newline, a CR in a 
> text/plain is a newline, a "CR<boundary> CR" sequence is a valid 
> multipart boundary, "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", 
> "CRCR" or "LFLF" sequences are valid separators between header and body 
> because they are considered as equivalent to "CRLFCRLF".
> - THis also means that writing in output this stuff will result in a 
> mime stream with NO isolated CRs or LFs (unless they are in a "binary" 
> encoded body).
> 
> Strict line delimiter parsing (I don't care if we have this now, I just 
> think we should have this in mind while factoring mime4j because it 
> should be possible to implement this with no major changes).
> - LFs and CRs are not newlines, they are not considered newlines and 
> results in errors raised by the parser (invalid header, invalid content, 
> and so on) that will result in a parsing failure or (if the raised 
> errors are ignored) in invalid DOM (I'm not sure how we currently handle 
> this case for non-expected 8bit content in an header, but it should be 
> the same).
> - writing in output this content should result in a well-formed content, so:
>    - if an LF in the header is somehow "encodable" as a valid sequence 
> it should be parsed as LF and then encoded while outputting. If instead 
> an LF in the header is not encodable then we should fail parsing or 
> remove it (or convert it to "?" or anything similar) if we want to be 
> lenient.
> 
> I'm not saying that I want mime4j to support all of this before a 
> release, I just want to understand if this is what you also expect and 
> if this can be considered a common goal.
> 
> Stefano

<rant disclaimer="please ignore">

HttpComponents project chose to depend on mime4j instead of developing a
similar solution because we thought it was the right thing to do. We
thought we should rather contribute to an existing project instead of
pursuing competing efforts for which we have neither resources nor the
right expertise. 

As a result we had to delay the next release of HttpClient by almost two
months waiting for a mime4j release. I do not see a point in waiting any
longer. I see no other way but dropping dependency on mime4j, at least
temporarily. 

I did my very best to resolve an old problem no one seemed eager to work
on for a year and a half. I will happily continue to contribute to this
project to my best abilities, but in this particular case I see no
justification to investing any more time in trying to satisfy someone
else's wish list.

</rant>

Stefano,

(1) I _personally_ see the strict handling of line delimiters as
_completely_ and _utterly_ pointless

(2) I encountered only two types of line delimiters in HTTP messages in
the wild: <CRLF> and <LF> (often mixed in the same message). I cannot
recall seeing messages where <CR> was used as a line delimiter. I can
live with any solution that enables me to configure mime4j to parse
messages where both <CRLF> and <LF> can be used as line delimiters.

Oleg

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> Robert Burrell Donkin wrote:
>> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>>> Robert Burrell Donkin ha scritto:
>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org> 
>>>> wrote:
>>>>> Robert Burrell Donkin ha scritto:
>>>>>> On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org>
>>>>>> wrote:
>>>>>>> Stefano Bagnara ha scritto:
>>>>>> <snip>
>>>>>>
>>>>>> can we rewind a little
>>>>>>
>>>>>>>> - If the message have only newlines it seems mime4j ends up 
>>>>>>>> outputting
>>>>>>>> headers with CRLF and body with LF.
>>>>>> am i right in assuming that this is about using Mime4J for
>>>>>> roundtripping via org.apache.james.mime4j.message.Message?
>>>>> It involve both reading and writing.
>>>>>
>>>>> In our specific case I record that we accept an LF as separator in
>>>>> headers,
>>>>> but we take a CR as a char part of the header (while it is invalid).
>>>>>
>>>>> E.g: I would say that in the case of an isolated CR in headers we 
>>>>> have 3
>>>>> options:
>>>>> 1) consider it a newline
>>>>>  1a) output it as-is when roundtripping
>>>>>  1b) convert it to CRLF when roundtripping
>>>>> 2) fail parsing (malformed message)
>>>>> 3) use it as part of the header value.
>>>>>
>>>>> Now we do #3 and I think this is the worst solution.
>>>>> I don't know if mime4j should support all of the 4 solutions above 
>>>>> for a
>>>>> CR
>>>>> (4 configurations seems too much to me) but I think we should 
>>>>> discuss the
>>>>> merit of each solution and decide what are the one we want to support.
>>>> i understand this argument. however, i still think we need to step
>>>> back a little and gain some perspective.
>>>>
>>>> round tripping involves two distinct components.  the parser parses
>>>> the message into a DOM (Message) which is then written out.
>>>>
>>>> AIUI it is this complete cycle that results in the line ending
>>>> inconsistency noted between the input and the output.  is my
>>>> understanding correct?
>>> I think we should discuss about parsing separated from outputting
>>> something we have in memory.
>>
>> Yes
>>> IMHO, it's clear we'll never be able to
>>> alter malformed mime content while preserving the malformations, so we
>>> have to think that in output we always have to create a canonical mime
>>> message. This is currently not the case, but this is the minor of my
>>> concern (because it is easier to fix, I think).
>>
>> So I think there's rough consensus that writing the DOM should
>> canonicalise. Yes, I agree that this can be accomodated by altering
>> the DOM writer.
>>> So the issue is also during parsing:
>>>
>>> 1) we now have special treatment for isolated LF, we do not have
>>> something similar for CR (AFAIK both are special end of line delimiters
>>> used in some specific platform and not compliant to the canonical mime
>>> format, so I think we *should* support both special chars (in a lenient
>>> parsing).
>> If this logic can be acommodated easily then it sounds like we
>> probably should unless there are good reasons not to
>>
>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>> the "line" concept: I do expect this one to treat "non canonical"
>>> newlines like the header/structure parser: if headers are allowed to
>>> terminate with an isolated LF then also lines in text content should do
>>> the same (because probably the whole mime message has LF instead of
>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>> encoded using LF instead of CRLF and that this specific encoding breaks
>>> binary parts, but we want to be smarter wrt this issue].
>>
>> TextBody is part of the DOM. This can and should be addressed there
>> (rather than in the parser). I think that doing this should satisfy
>> both needs without compromising the performance of the parser.
>>
> 
> If this is indeed something we can all agree on, I can try to solve the 
> first problem (strict/lenient line delimiter handling) using a pluggable 
> strategy of some kind.
> 
> Oleg

My limited knowledge of mime4j details doesn't let me reply "+1". So I 
simply tell what I expect from mime4j as an user:

Lenient line delimiter parsing:
- consider isolated LF and CR in the mime stream as newlines as long as 
a newline concept exists in that specific place (everywhere but binary 
body parts having ContentTransferEncoding = "binary").
- This means that a CR in a base64 stream is a newline, a CR in a 
text/plain is a newline, a "CR<boundary> CR" sequence is a valid 
multipart boundary, "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", 
"CRCR" or "LFLF" sequences are valid separators between header and body 
because they are considered as equivalent to "CRLFCRLF".
- THis also means that writing in output this stuff will result in a 
mime stream with NO isolated CRs or LFs (unless they are in a "binary" 
encoded body).

Strict line delimiter parsing (I don't care if we have this now, I just 
think we should have this in mind while factoring mime4j because it 
should be possible to implement this with no major changes).
- LFs and CRs are not newlines, they are not considered newlines and 
results in errors raised by the parser (invalid header, invalid content, 
and so on) that will result in a parsing failure or (if the raised 
errors are ignored) in invalid DOM (I'm not sure how we currently handle 
this case for non-expected 8bit content in an header, but it should be 
the same).
- writing in output this content should result in a well-formed content, so:
   - if an LF in the header is somehow "encodable" as a valid sequence 
it should be parsed as LF and then encoded while outputting. If instead 
an LF in the header is not encodable then we should fail parsing or 
remove it (or convert it to "?" or anything similar) if we want to be 
lenient.

I'm not saying that I want mime4j to support all of this before a 
release, I just want to understand if this is what you also expect and 
if this can be considered a common goal.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
Robert Burrell Donkin wrote:
> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>> Robert Burrell Donkin ha scritto:
>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
>>>> Robert Burrell Donkin ha scritto:
>>>>> On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org>
>>>>> wrote:
>>>>>> Stefano Bagnara ha scritto:
>>>>> <snip>
>>>>>
>>>>> can we rewind a little
>>>>>
>>>>>>> - If the message have only newlines it seems mime4j ends up outputting
>>>>>>> headers with CRLF and body with LF.
>>>>> am i right in assuming that this is about using Mime4J for
>>>>> roundtripping via org.apache.james.mime4j.message.Message?
>>>> It involve both reading and writing.
>>>>
>>>> In our specific case I record that we accept an LF as separator in
>>>> headers,
>>>> but we take a CR as a char part of the header (while it is invalid).
>>>>
>>>> E.g: I would say that in the case of an isolated CR in headers we have 3
>>>> options:
>>>> 1) consider it a newline
>>>>  1a) output it as-is when roundtripping
>>>>  1b) convert it to CRLF when roundtripping
>>>> 2) fail parsing (malformed message)
>>>> 3) use it as part of the header value.
>>>>
>>>> Now we do #3 and I think this is the worst solution.
>>>> I don't know if mime4j should support all of the 4 solutions above for a
>>>> CR
>>>> (4 configurations seems too much to me) but I think we should discuss the
>>>> merit of each solution and decide what are the one we want to support.
>>> i understand this argument. however, i still think we need to step
>>> back a little and gain some perspective.
>>>
>>> round tripping involves two distinct components.  the parser parses
>>> the message into a DOM (Message) which is then written out.
>>>
>>> AIUI it is this complete cycle that results in the line ending
>>> inconsistency noted between the input and the output.  is my
>>> understanding correct?
>> I think we should discuss about parsing separated from outputting
>> something we have in memory.
> 
> Yes
>> IMHO, it's clear we'll never be able to
>> alter malformed mime content while preserving the malformations, so we
>> have to think that in output we always have to create a canonical mime
>> message. This is currently not the case, but this is the minor of my
>> concern (because it is easier to fix, I think).
> 
> So I think there's rough consensus that writing the DOM should
> canonicalise. Yes, I agree that this can be accomodated by altering
> the DOM writer.
>> So the issue is also during parsing:
>>
>> 1) we now have special treatment for isolated LF, we do not have
>> something similar for CR (AFAIK both are special end of line delimiters
>> used in some specific platform and not compliant to the canonical mime
>> format, so I think we *should* support both special chars (in a lenient
>> parsing).
> If this logic can be acommodated easily then it sounds like we
> probably should unless there are good reasons not to
> 
>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>> the "line" concept: I do expect this one to treat "non canonical"
>> newlines like the header/structure parser: if headers are allowed to
>> terminate with an isolated LF then also lines in text content should do
>> the same (because probably the whole mime message has LF instead of
>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>> encoded using LF instead of CRLF and that this specific encoding breaks
>> binary parts, but we want to be smarter wrt this issue].
> 
> TextBody is part of the DOM. This can and should be addressed there
> (rather than in the parser). I think that doing this should satisfy
> both needs without compromising the performance of the parser.
> 

If this is indeed something we can all agree on, I can try to solve the 
first problem (strict/lenient line delimiter handling) using a pluggable 
strategy of some kind.

Oleg


> Robert
> 
>> Stefano
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
>> For additional commands, e-mail: server-dev-help@james.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>> the "line" concept: I do expect this one to treat "non canonical"
>> newlines like the header/structure parser: if headers are allowed to
>> terminate with an isolated LF then also lines in text content should do
>> the same (because probably the whole mime message has LF instead of
>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>> encoded using LF instead of CRLF and that this specific encoding breaks
>> binary parts, but we want to be smarter wrt this issue].
> 
> TextBody is part of the DOM. This can and should be addressed there
> (rather than in the parser). I think that doing this should satisfy
> both needs without compromising the performance of the parser.

I don't think that LF/CR replacement is a performance issue: most 
probably the current implementation of the filter stream has performance 
issues, but this does not mean that replacing newlines is at all an 
issue (we already have to scan for LF/CR/CRLF anyway).

Indeed if we think that it is better not to do that during the parsing 
there is no need to talk about performance (I admit I don't know mime4j 
internals and I don't know the format used to temporarily store parts to 
disk so I don't know if it is better to alter it while writing them or 
while reading them back).

What about RootInputStream line counting? How should we update it if we 
support isolated LF/CR in lenient parsing? How should it behave wrt 
"binary encoded" parts?

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On 7/18/08, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
>>> Robert Burrell Donkin ha scritto:
>>>> On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org>
>>>> wrote:
>>>>> Stefano Bagnara ha scritto:
>>>> <snip>
>>>>
>>>> can we rewind a little
>>>>
>>>>>> - If the message have only newlines it seems mime4j ends up outputting
>>>>>> headers with CRLF and body with LF.
>>>> am i right in assuming that this is about using Mime4J for
>>>> roundtripping via org.apache.james.mime4j.message.Message?
>>> It involve both reading and writing.
>>>
>>> In our specific case I record that we accept an LF as separator in
>>> headers,
>>> but we take a CR as a char part of the header (while it is invalid).
>>>
>>> E.g: I would say that in the case of an isolated CR in headers we have 3
>>> options:
>>> 1) consider it a newline
>>>  1a) output it as-is when roundtripping
>>>  1b) convert it to CRLF when roundtripping
>>> 2) fail parsing (malformed message)
>>> 3) use it as part of the header value.
>>>
>>> Now we do #3 and I think this is the worst solution.
>>> I don't know if mime4j should support all of the 4 solutions above for a
>>> CR
>>> (4 configurations seems too much to me) but I think we should discuss the
>>> merit of each solution and decide what are the one we want to support.
>>
>> i understand this argument. however, i still think we need to step
>> back a little and gain some perspective.
>>
>> round tripping involves two distinct components.  the parser parses
>> the message into a DOM (Message) which is then written out.
>>
>> AIUI it is this complete cycle that results in the line ending
>> inconsistency noted between the input and the output.  is my
>> understanding correct?
>
> I think we should discuss about parsing separated from outputting
> something we have in memory.

Yes
> IMHO, it's clear we'll never be able to
> alter malformed mime content while preserving the malformations, so we
> have to think that in output we always have to create a canonical mime
> message. This is currently not the case, but this is the minor of my
> concern (because it is easier to fix, I think).

So I think there's rough consensus that writing the DOM should
canonicalise. Yes, I agree that this can be accomodated by altering
the DOM writer.
> So the issue is also during parsing:
>
> 1) we now have special treatment for isolated LF, we do not have
> something similar for CR (AFAIK both are special end of line delimiters
> used in some specific platform and not compliant to the canonical mime
> format, so I think we *should* support both special chars (in a lenient
> parsing).
If this logic can be acommodated easily then it sounds like we
probably should unless there are good reasons not to

> 2) ((TextBody) b).getReader(). This give me a reader, so this support
> the "line" concept: I do expect this one to treat "non canonical"
> newlines like the header/structure parser: if headers are allowed to
> terminate with an isolated LF then also lines in text content should do
> the same (because probably the whole mime message has LF instead of
> CRLF). [RFC seems to suggest that the fact is that the MIME message is
> encoded using LF instead of CRLF and that this specific encoding breaks
> binary parts, but we want to be smarter wrt this issue].

TextBody is part of the DOM. This can and should be addressed there
(rather than in the parser). I think that doing this should satisfy
both needs without compromising the performance of the parser.

Robert

>
> Stefano
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
>> Robert Burrell Donkin ha scritto:
>>> On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org> wrote:
>>>> Stefano Bagnara ha scritto:
>>> <snip>
>>>
>>> can we rewind a little
>>>
>>>>> - If the message have only newlines it seems mime4j ends up outputting
>>>>> headers with CRLF and body with LF.
>>> am i right in assuming that this is about using Mime4J for
>>> roundtripping via org.apache.james.mime4j.message.Message?
>> It involve both reading and writing.
>>
>> In our specific case I record that we accept an LF as separator in headers,
>> but we take a CR as a char part of the header (while it is invalid).
>>
>> E.g: I would say that in the case of an isolated CR in headers we have 3
>> options:
>> 1) consider it a newline
>>  1a) output it as-is when roundtripping
>>  1b) convert it to CRLF when roundtripping
>> 2) fail parsing (malformed message)
>> 3) use it as part of the header value.
>>
>> Now we do #3 and I think this is the worst solution.
>> I don't know if mime4j should support all of the 4 solutions above for a CR
>> (4 configurations seems too much to me) but I think we should discuss the
>> merit of each solution and decide what are the one we want to support.
> 
> i understand this argument. however, i still think we need to step
> back a little and gain some perspective.
> 
> round tripping involves two distinct components.  the parser parses
> the message into a DOM (Message) which is then written out.
> 
> AIUI it is this complete cycle that results in the line ending
> inconsistency noted between the input and the output.  is my
> understanding correct?

I think we should discuss about parsing separated from outputting 
something we have in memory. IMHO, it's clear we'll never be able to 
alter malformed mime content while preserving the malformations, so we 
have to think that in output we always have to create a canonical mime 
message. This is currently not the case, but this is the minor of my 
concern (because it is easier to fix, I think).

So the issue is also during parsing:

1) we now have special treatment for isolated LF, we do not have 
something similar for CR (AFAIK both are special end of line delimiters 
used in some specific platform and not compliant to the canonical mime 
format, so I think we *should* support both special chars (in a lenient 
parsing).

2) ((TextBody) b).getReader(). This give me a reader, so this support 
the "line" concept: I do expect this one to treat "non canonical" 
newlines like the header/structure parser: if headers are allowed to 
terminate with an isolated LF then also lines in text content should do 
the same (because probably the whole mime message has LF instead of 
CRLF). [RFC seems to suggest that the fact is that the MIME message is 
encoded using LF instead of CRLF and that this specific encoding breaks 
binary parts, but we want to be smarter wrt this issue].

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
>>
>> On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org> wrote:
>>>
>>> Stefano Bagnara ha scritto:
>>
>> <snip>
>>
>> can we rewind a little
>>
>>>> - If the message have only newlines it seems mime4j ends up outputting
>>>> headers with CRLF and body with LF.
>>
>> am i right in assuming that this is about using Mime4J for
>> roundtripping via org.apache.james.mime4j.message.Message?
>
> It involve both reading and writing.
>
> In our specific case I record that we accept an LF as separator in headers,
> but we take a CR as a char part of the header (while it is invalid).
>
> E.g: I would say that in the case of an isolated CR in headers we have 3
> options:
> 1) consider it a newline
>  1a) output it as-is when roundtripping
>  1b) convert it to CRLF when roundtripping
> 2) fail parsing (malformed message)
> 3) use it as part of the header value.
>
> Now we do #3 and I think this is the worst solution.
> I don't know if mime4j should support all of the 4 solutions above for a CR
> (4 configurations seems too much to me) but I think we should discuss the
> merit of each solution and decide what are the one we want to support.

i understand this argument. however, i still think we need to step
back a little and gain some perspective.

round tripping involves two distinct components.  the parser parses
the message into a DOM (Message) which is then written out.

AIUI it is this complete cycle that results in the line ending
inconsistency noted between the input and the output.  is my
understanding correct?

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org> wrote:
>> Stefano Bagnara ha scritto:
> 
> <snip>
> 
> can we rewind a little
> 
>>> - If the message have only newlines it seems mime4j ends up outputting
>>> headers with CRLF and body with LF.
> 
> am i right in assuming that this is about using Mime4J for
> roundtripping via org.apache.james.mime4j.message.Message?

It involve both reading and writing.

In our specific case I record that we accept an LF as separator in 
headers, but we take a CR as a char part of the header (while it is 
invalid).

E.g: I would say that in the case of an isolated CR in headers we have 3 
options:
1) consider it a newline
   1a) output it as-is when roundtripping
   1b) convert it to CRLF when roundtripping
2) fail parsing (malformed message)
3) use it as part of the header value.

Now we do #3 and I think this is the worst solution.
I don't know if mime4j should support all of the 4 solutions above for a 
CR (4 configurations seems too much to me) but I think we should discuss 
the merit of each solution and decide what are the one we want to support.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Stefano Bagnara ha scritto:

<snip>

can we rewind a little

>> - If the message have only newlines it seems mime4j ends up outputting
>> headers with CRLF and body with LF.

am i right in assuming that this is about using Mime4J for
roundtripping via org.apache.james.mime4j.message.Message?

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 10:25 +0200, Stefano Bagnara wrote:
>> This make it clear, to me, that anyway we want to support the binary 
>> encoding (at least when it is specified and when other environment says 
>> that it is the default behaviour).
>>
>> Second thing I would like to understand if this is the only case where 
>> conversion of isolated CR and LF to CRLF would create issues or if HTTP 
>> shows more issues.
>>
>> Third I would like to understand if simply having mime4j to not alter 
>> any isolated CR and LF and fail parsing when an isolated CR or LF is 
>> found outside binary content would be ok for http needs.
> 
> Unfortunately not. There are lots of HTTP services that mix LF and CRLF
> line delimiters in the same packet. In the HTTP world there is no way
> around tolerating LFs and treating them as equivalent to CRLF.  

What do you mean with "tolerating LFs" ?
What does an LF in a header do? Is it an endofline (and end of header) 
or a bad char in the header?
If it is the endofline then you are "virtually" converting an LF encoded 
mime stream to its canonical representation (line ending with CRLF).

Is it OK for HTTP if isolated LF and isolated CR are both considered 
line terminations ONLY in the header+structure part of mime streams AND 
in parts not having "Content-Transfer-Encoding: binary" (also adding a 
specific support for configuring mime4j to consider "binary" anything 
not having a content transfer encoding) ? Or would this still make 
mime4j unusable for HTTP? In this case can you provide real world 
examples we can evaluate?

>>>>>> I don't understand why a conversion is wrong for the http case (when 
>>>>>> does it happen that you have to deal with isolated LF ?).
>>>>> How about binary data in multipart/form encoded requests?
>>>> Can you tell me what RFC are we talking about?
>>>>
>>> We are not taking any RFC here. We are talking real-world content.
>> Well, they are using a protocol, anyway. What to do is specified in an 
>> RFC. I want to know what is the RFC and then to understand if they are 
>> doing something wrong or if we simply misunderstood the RFC or if there 
>> is an RFC we don't know.
>> I'm not saying that we should ignore real-world content if it is non 
>> compliant, I'm saying that we have to understand it better.
>>
>> In this case I think I was looking for this RFC:
>> http://www.faqs.org/rfcs/rfc1867.html
>>
>> I'm not sure that the RFC is the latest and is the only one involved but 
>> there I read (about multipart/form-data):
>> ----------------
>>     While the HTTP protocol can transport arbitrary BINARY data, the
>>     default for mail transport (e.g., if the ACTION is a "mailto:" URL)
>>     is the 7BIT encoding.  The value supplied for a part may need to be
>>     encoded and the "content-transfer-encoding" header supplied if the
>>     value does not conform to the default encoding.  [See section 5 of
>>     RFC 1521 for more details.]
>> ---------------
>>
>> http://www.faqs.org/rfcs/rfc1521.html provides a long paragraph about 
>> content-transfer-encoding but I'm not sure I grok it all.
>>  From my current understanding it does not define a default transfer 
>> encoding and it says that each protocol could define its default (also 
>> telling that SMTP rfc821 define the 7bit as the default).
>>
>> So maybe there is an HTTP RFC that tell that in an HTTP world the 
>> default is "binary".
>>
> 
> I am not aware of such RFC but it can well be I have just never come
> across such a document.

Is RFC1867 the only RFC about HTTP/MIME "collaboration"?

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2008-07-18 at 10:25 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Thu, 2008-07-17 at 20:25 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> Stefano Bagnara wrote:
> >>>>
> >>>> I've had a fast read of the RFC2822 about this issue. It insists that 
> >>>> CRLF is the only valid delimiter for a canonical rfc822 message. 
> >>>> Furthermore rfc2822 does not allow the use of isolated CR or LF.
> >>>> So, whenever isolated CR or isolated LF is found we have a malformed 
> >>>> rfc822 message and we have to define how to deal with it.
> >>>>
> >>> Tell the users of IE about it
> >> Can you provide more informations? What is the issue in IE?
> > 
> > Blatant disregard of all standards imaginable. Mozilla is actually
> > hardly any better.
> 
> I'm used to non compliant stuff: in the SMTP world there is plenty.
> What I want to be sure is that we don't do the same they did: simply 
> working "by example" without reading accurately the RFCs tend to create 
> multiple incompatible results. I wouldn't like if mime4j created more 
> non-standard output for people to blame us.
> 
> >>  Does it post 
> >> malformed mime? Can you be more precise about what version of IE and 
> >> what kind of malformed sequences are produced?
> > 
> > All common browsers known to me put raw binary in the multipart/form
> > coded requests. I do not have an IE wire dump handy but I have a few
> > ones of Firefox
> > 
> > https://issues.apache.org/jira/browse/HTTPCLIENT-784    
> > https://issues.apache.org/jira/browse/HTTPCLIENT-785
> 
>  From the 2 bugs it seems they put raw binary without any header 
> specifying that it is pure binary. In standard MIME (to my knowledge) 
> this should be always be preceeded by a "Content-Transfer-Encoding: 
> binary" header (notice "binary" and not "8bit", they are different).
> 
> "Content-Transfer-Encoding: binary" is the only place where isolated CR 
> and LF are allowed.
> 

My bad. I assumed 8bit encoding was the same as binary.


> First thing I don't know if the fact that the CTE-binary is missing is 
> because in the HTTP world use it as the default (as opposite to SMTP 
> 7bit default) or because they are abusing the MIME spec: does anyone 
> know this?
> 

I am not aware any HTTP specific requirements, so in my opinion 7bit
should be assumed to be the default encoding regardless of the
underlying transport.  


> This make it clear, to me, that anyway we want to support the binary 
> encoding (at least when it is specified and when other environment says 
> that it is the default behaviour).
> 
> Second thing I would like to understand if this is the only case where 
> conversion of isolated CR and LF to CRLF would create issues or if HTTP 
> shows more issues.
> 
> Third I would like to understand if simply having mime4j to not alter 
> any isolated CR and LF and fail parsing when an isolated CR or LF is 
> found outside binary content would be ok for http needs.
> 


Unfortunately not. There are lots of HTTP services that mix LF and CRLF
line delimiters in the same packet. In the HTTP world there is no way
around tolerating LFs and treating them as equivalent to CRLF.  


> >>>> I don't understand why a conversion is wrong for the http case (when 
> >>>> does it happen that you have to deal with isolated LF ?).
> >>> How about binary data in multipart/form encoded requests?
> >> Can you tell me what RFC are we talking about?
> >>
> > 
> > We are not taking any RFC here. We are talking real-world content.
> 
> Well, they are using a protocol, anyway. What to do is specified in an 
> RFC. I want to know what is the RFC and then to understand if they are 
> doing something wrong or if we simply misunderstood the RFC or if there 
> is an RFC we don't know.
> I'm not saying that we should ignore real-world content if it is non 
> compliant, I'm saying that we have to understand it better.
> 
> In this case I think I was looking for this RFC:
> http://www.faqs.org/rfcs/rfc1867.html
> 
> I'm not sure that the RFC is the latest and is the only one involved but 
> there I read (about multipart/form-data):
> ----------------
>     While the HTTP protocol can transport arbitrary BINARY data, the
>     default for mail transport (e.g., if the ACTION is a "mailto:" URL)
>     is the 7BIT encoding.  The value supplied for a part may need to be
>     encoded and the "content-transfer-encoding" header supplied if the
>     value does not conform to the default encoding.  [See section 5 of
>     RFC 1521 for more details.]
> ---------------
> 
> http://www.faqs.org/rfcs/rfc1521.html provides a long paragraph about 
> content-transfer-encoding but I'm not sure I grok it all.
>  From my current understanding it does not define a default transfer 
> encoding and it says that each protocol could define its default (also 
> telling that SMTP rfc821 define the 7bit as the default).
> 
> So maybe there is an HTTP RFC that tell that in an HTTP world the 
> default is "binary".
> 

I am not aware of such RFC but it can well be I have just never come
across such a document.

Oleg


> What is clear is that any CR/LF conversion in a "binary" content is BAD 
> and we don't want MIME4J to do that. So, if we want to be permissive 
> with some content received with bad newlines we have to make sure we 
> don't break binary content.
> 
> Furthermore I would say that there is a need for a "default content 
> transfer encoding" to be used when one is not specified in headers 
> (because this does not seem part of the MIME spec, but of specific 
> protocols specifications).
> 
> WDYT?
> 
> Stefano
> 
> PS: please note that I'm not saying that we should block 0.4 release for 
> this issue. I just think this issue is important and we should care for 
> it, but this can land in 0.5 if we want to.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Thu, 2008-07-17 at 20:25 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> Stefano Bagnara wrote:
>>>>
>>>> I've had a fast read of the RFC2822 about this issue. It insists that 
>>>> CRLF is the only valid delimiter for a canonical rfc822 message. 
>>>> Furthermore rfc2822 does not allow the use of isolated CR or LF.
>>>> So, whenever isolated CR or isolated LF is found we have a malformed 
>>>> rfc822 message and we have to define how to deal with it.
>>>>
>>> Tell the users of IE about it
>> Can you provide more informations? What is the issue in IE?
> 
> Blatant disregard of all standards imaginable. Mozilla is actually
> hardly any better.

I'm used to non compliant stuff: in the SMTP world there is plenty.
What I want to be sure is that we don't do the same they did: simply 
working "by example" without reading accurately the RFCs tend to create 
multiple incompatible results. I wouldn't like if mime4j created more 
non-standard output for people to blame us.

>>  Does it post 
>> malformed mime? Can you be more precise about what version of IE and 
>> what kind of malformed sequences are produced?
> 
> All common browsers known to me put raw binary in the multipart/form
> coded requests. I do not have an IE wire dump handy but I have a few
> ones of Firefox
> 
> https://issues.apache.org/jira/browse/HTTPCLIENT-784    
> https://issues.apache.org/jira/browse/HTTPCLIENT-785

 From the 2 bugs it seems they put raw binary without any header 
specifying that it is pure binary. In standard MIME (to my knowledge) 
this should be always be preceeded by a "Content-Transfer-Encoding: 
binary" header (notice "binary" and not "8bit", they are different).

"Content-Transfer-Encoding: binary" is the only place where isolated CR 
and LF are allowed.

First thing I don't know if the fact that the CTE-binary is missing is 
because in the HTTP world use it as the default (as opposite to SMTP 
7bit default) or because they are abusing the MIME spec: does anyone 
know this?

This make it clear, to me, that anyway we want to support the binary 
encoding (at least when it is specified and when other environment says 
that it is the default behaviour).

Second thing I would like to understand if this is the only case where 
conversion of isolated CR and LF to CRLF would create issues or if HTTP 
shows more issues.

Third I would like to understand if simply having mime4j to not alter 
any isolated CR and LF and fail parsing when an isolated CR or LF is 
found outside binary content would be ok for http needs.

>>>> I don't understand why a conversion is wrong for the http case (when 
>>>> does it happen that you have to deal with isolated LF ?).
>>> How about binary data in multipart/form encoded requests?
>> Can you tell me what RFC are we talking about?
>>
> 
> We are not taking any RFC here. We are talking real-world content.

Well, they are using a protocol, anyway. What to do is specified in an 
RFC. I want to know what is the RFC and then to understand if they are 
doing something wrong or if we simply misunderstood the RFC or if there 
is an RFC we don't know.
I'm not saying that we should ignore real-world content if it is non 
compliant, I'm saying that we have to understand it better.

In this case I think I was looking for this RFC:
http://www.faqs.org/rfcs/rfc1867.html

I'm not sure that the RFC is the latest and is the only one involved but 
there I read (about multipart/form-data):
----------------
    While the HTTP protocol can transport arbitrary BINARY data, the
    default for mail transport (e.g., if the ACTION is a "mailto:" URL)
    is the 7BIT encoding.  The value supplied for a part may need to be
    encoded and the "content-transfer-encoding" header supplied if the
    value does not conform to the default encoding.  [See section 5 of
    RFC 1521 for more details.]
---------------

http://www.faqs.org/rfcs/rfc1521.html provides a long paragraph about 
content-transfer-encoding but I'm not sure I grok it all.
 From my current understanding it does not define a default transfer 
encoding and it says that each protocol could define its default (also 
telling that SMTP rfc821 define the 7bit as the default).

So maybe there is an HTTP RFC that tell that in an HTTP world the 
default is "binary".

What is clear is that any CR/LF conversion in a "binary" content is BAD 
and we don't want MIME4J to do that. So, if we want to be permissive 
with some content received with bad newlines we have to make sure we 
don't break binary content.

Furthermore I would say that there is a need for a "default content 
transfer encoding" to be used when one is not specified in headers 
(because this does not seem part of the MIME spec, but of specific 
protocols specifications).

WDYT?

Stefano

PS: please note that I'm not saying that we should block 0.4 release for 
this issue. I just think this issue is important and we should care for 
it, but this can land in 0.5 if we want to.

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2008-07-17 at 20:25 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > Stefano Bagnara wrote:
> >>
> >>
> >> I've had a fast read of the RFC2822 about this issue. It insists that 
> >> CRLF is the only valid delimiter for a canonical rfc822 message. 
> >> Furthermore rfc2822 does not allow the use of isolated CR or LF.
> >> So, whenever isolated CR or isolated LF is found we have a malformed 
> >> rfc822 message and we have to define how to deal with it.
> >>
> > 
> > Tell the users of IE about it
> 
> Can you provide more informations? What is the issue in IE?

Blatant disregard of all standards imaginable. Mozilla is actually
hardly any better.

>  Does it post 
> malformed mime? Can you be more precise about what version of IE and 
> what kind of malformed sequences are produced?
> 

All common browsers known to me put raw binary in the multipart/form
coded requests. I do not have an IE wire dump handy but I have a few
ones of Firefox

https://issues.apache.org/jira/browse/HTTPCLIENT-784    
https://issues.apache.org/jira/browse/HTTPCLIENT-785

> >> I don't understand why a conversion is wrong for the http case (when 
> >> does it happen that you have to deal with isolated LF ?).
> > 
> > How about binary data in multipart/form encoded requests?
> 
> Can you tell me what RFC are we talking about?
> 

We are not taking any RFC here. We are talking real-world content.

Oleg

> >> So we have options:
> >> 1) fail parsing anything containing isolated CR or LF chars.
> >> 2) parse isolated CR or isolated LF as CRLF and in this case:
> >>    a. make sure we output well formed rfc2822 message (CRLF only)
> >>    b. keep bad newlines as is (more difficult to implement)
> >>    c. randomly convert only some of them (as we do now)
> >>
> >> With regard to #2 we also have to decide whether the choice also 
> >> applies to parsing of Base64 encoded nested rfc822 messages or not.
> >>
> >> Stefano
> >>
> > 
> > Whatever you opt to do _please_ leave a possibility to disable 
> > EOLConvertingInputStream
> 
> I think that we probably will need a configurable options, make sure I'm 
> only trying to understand what RFC asks us to do and what we can do to 
> accomodate real world use case while being RFC compliant.
> 
> If we make it configurable we should find a way to have a simple 
> configuration supporting the non-rfc compliant solution easily.
> But we first have to understand what is the RFC compliant behaviour and 
> what is the preferred non-RFC compliant behaviour. I'm still doubtful 
> about both issues and I hope more people will join this discussion: we 
> need good ideas, here.
> 
> Stefano
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> Stefano Bagnara wrote:
>>
>>
>> I've had a fast read of the RFC2822 about this issue. It insists that 
>> CRLF is the only valid delimiter for a canonical rfc822 message. 
>> Furthermore rfc2822 does not allow the use of isolated CR or LF.
>> So, whenever isolated CR or isolated LF is found we have a malformed 
>> rfc822 message and we have to define how to deal with it.
>>
> 
> Tell the users of IE about it

Can you provide more informations? What is the issue in IE? Does it post 
malformed mime? Can you be more precise about what version of IE and 
what kind of malformed sequences are produced?

>> I don't understand why a conversion is wrong for the http case (when 
>> does it happen that you have to deal with isolated LF ?).
> 
> How about binary data in multipart/form encoded requests?

Can you tell me what RFC are we talking about?

>> So we have options:
>> 1) fail parsing anything containing isolated CR or LF chars.
>> 2) parse isolated CR or isolated LF as CRLF and in this case:
>>    a. make sure we output well formed rfc2822 message (CRLF only)
>>    b. keep bad newlines as is (more difficult to implement)
>>    c. randomly convert only some of them (as we do now)
>>
>> With regard to #2 we also have to decide whether the choice also 
>> applies to parsing of Base64 encoded nested rfc822 messages or not.
>>
>> Stefano
>>
> 
> Whatever you opt to do _please_ leave a possibility to disable 
> EOLConvertingInputStream

I think that we probably will need a configurable options, make sure I'm 
only trying to understand what RFC asks us to do and what we can do to 
accomodate real world use case while being RFC compliant.

If we make it configurable we should find a way to have a simple 
configuration supporting the non-rfc compliant solution easily.
But we first have to understand what is the RFC compliant behaviour and 
what is the preferred non-RFC compliant behaviour. I'm still doubtful 
about both issues and I hope more people will join this discussion: we 
need good ideas, here.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
Stefano Bagnara wrote:
> 
> 
> I've had a fast read of the RFC2822 about this issue. It insists that 
> CRLF is the only valid delimiter for a canonical rfc822 message. 
> Furthermore rfc2822 does not allow the use of isolated CR or LF.
> So, whenever isolated CR or isolated LF is found we have a malformed 
> rfc822 message and we have to define how to deal with it.
> 

Tell the users of IE about it


> I don't understand why a conversion is wrong for the http case (when 
> does it happen that you have to deal with isolated LF ?).
> 

How about binary data in multipart/form encoded requests?


> So we have options:
> 1) fail parsing anything containing isolated CR or LF chars.
> 2) parse isolated CR or isolated LF as CRLF and in this case:
>    a. make sure we output well formed rfc2822 message (CRLF only)
>    b. keep bad newlines as is (more difficult to implement)
>    c. randomly convert only some of them (as we do now)
> 
> With regard to #2 we also have to decide whether the choice also applies 
> to parsing of Base64 encoded nested rfc822 messages or not.
> 
> Stefano
> 

Whatever you opt to do _please_ leave a possibility to disable 
EOLConvertingInputStream

Oleg


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Stefano Bagnara ha scritto:
> Stefano Bagnara ha scritto:
>> I noticed that at a point in past the EOLConvertingInputStream has 
>> been removed from the chain.
>>
>> I think this create issues when we parse an input file having only \n 
>> and write it in output.
>>
>> - It seems that we parse most of the code only checking for \n (what 
>> does it happen when instead there are only \r? what should we do?)
>>
>> - If the message have only newlines it seems mime4j ends up outputting 
>> headers with CRLF and body with LF.
>>
>> - If the input message have CR ending lines they are not considered by 
>> mime4j.
>>
>> IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
>>
>> If we do that we have to take care of encoded nested messages: they 
>> could have again LF, CR and CRLF like the top stream.
>>
>>
>> What is the right approach? Should we add a EOLConvertingInputStream 
>> (CONVERT_BOTH) to every level of parsing or should we fail to parse 
>> messages with bad newlines?
>>
>> I don't like the current behaviour where we accept some malformed data 
>> (LF alone are considered CRLF from our parser), we change some of them 
>> (the one between headers are converted to CRLF) and we still output 
>> malformed data.
>>
>> Opinions?
> 
> I tried this patch and it seems to work fine (even if it breaks one of 
> our core tests that do not expect a CR in an header to be considered a 
> newline):
> 
> Index: src/main/java/org/apache/james/mime4j/MimeEntity.java
> ===================================================================
> --- src/main/java/org/apache/james/mime4j/MimeEntity.java    (revision 
> 677582)
> +++ src/main/java/org/apache/james/mime4j/MimeEntity.java    (working copy)
> @@ -197,7 +197,7 @@
>          InputStream instream;
>          if (MimeUtil.isBase64Encoding(transferEncoding)) {
>              log.debug("base64 encoded message/rfc822 detected");
> -            instream = new Base64InputStream(dataStream);
> +            instream = new EOLConvertingInputStream(new 
> Base64InputStream(dataStream));
>          } else if (MimeUtil.isQuotedPrintableEncoded(transferEncoding)) {
>              log.debug("quoted-printable encoded message/rfc822 detected");
>              instream = new QuotedPrintableInputStream(dataStream);
> Index: src/main/java/org/apache/james/mime4j/MimeTokenStream.java
> ===================================================================
> --- src/main/java/org/apache/james/mime4j/MimeTokenStream.java    
> (revision 676846)
> +++ src/main/java/org/apache/james/mime4j/MimeTokenStream.java    
> (working copy)
> @@ -143,7 +143,7 @@
> 
>      private void doParse(InputStream stream, String contentType) {
>          entities.clear();
> -        rootInputStream = new RootInputStream(stream);
> +        rootInputStream = new RootInputStream(new 
> EOLConvertingInputStream(stream));
>          inbuffer = new BufferedLineReaderInputStream(rootInputStream, 4 
> * 1024);
>          switch (recursionMode) {
>          case M_RAW:
> 
> 
> IIRC the EOLConvertingInputStream was removed because of performance issue.

Oleg reported this on a JIRA issue:
----
Indiscriminate conversion of line delimiters regardless of their 
position within the data stream is plain WRONG. I am still of an opinion 
EOLConvertingInputStream is utterly and helplessly broken, at least for 
MIME content transmitted over HTTP. The change you are proposing makes 
mime4j simply useless for HttpClient and FileUpload

http://marc.info/?l=james-dev&m=121528134811461&w=2
--------

I've had a fast read of the RFC2822 about this issue. It insists that 
CRLF is the only valid delimiter for a canonical rfc822 message. 
Furthermore rfc2822 does not allow the use of isolated CR or LF.
So, whenever isolated CR or isolated LF is found we have a malformed 
rfc822 message and we have to define how to deal with it.

I don't understand why a conversion is wrong for the http case (when 
does it happen that you have to deal with isolated LF ?).

So we have options:
1) fail parsing anything containing isolated CR or LF chars.
2) parse isolated CR or isolated LF as CRLF and in this case:
    a. make sure we output well formed rfc2822 message (CRLF only)
    b. keep bad newlines as is (more difficult to implement)
    c. randomly convert only some of them (as we do now)

With regard to #2 we also have to decide whether the choice also applies 
to parsing of Base64 encoded nested rfc822 messages or not.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Thu, Jul 17, 2008 at 8:09 PM, Oleg Kalnichevski <ol...@apache.org> wrote:
> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>> > Stefano Bagnara wrote:
>
> ...
>
>> > Not only does this change completely reverts the performance gains and
>> > makes the whole refactroring exercise completely pointless due to an
>> > utterly inefficient implementation of EOLConvertingInputStream, it is
>> > also conceptually wrong (in my humble opinion), as it causes mime4j to
>> > corrupt 8bit encoded 'application/octet-stream' content. This basically
>> > renders mime4j incompatible with commons browsers and HttpClient
>>
>> The performance of the EOLConvertingInputStream is not important at all
>> if removing it we have an unusable library.
>
> And the last thing. This kind of argument works both ways. The strict
> RFC compliance is not important if we have an unusable library as a
> result.

please let's all step back a little and take a deep breath

- robert

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>>>> Stefano Bagnara wrote:
>>>>> ...
>>>>>
>>>>>> As I said the strict mode would only be useful to users of mime4j 
>>>>>> wanting to use mime4j as a validator to check RFC compliance. You 
>>>>>> know, mime4j born for SMTP, but now you need it for HTTP and 
>>>>>> someone else may want to do a validator. So let's not keep our 
>>>>>> eyes closed once again.
>>>>>>
>>>>> OK, I fail to see any practical benefit of that aside from a nice warm
>>>>> feeling about being 100% compliant, but I admit I am biased.
>>>>>
>>>>>>> Anyways, let's talk code now. How about this?
>>>>>>>
>>>>>>> (1)
>>>>>>>
>>>>>>> interface LineDelimiterStrategy {
>>>>>>>
>>>>>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>>>>>>     throws MimeException;
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> One can provide MimeTokenStream with an implementation of this 
>>>>>>> interface
>>>>>>> at the construction time. MimeTokenStream it its turn passes a
>>>>>>> reference to that class to all parser components that need to 
>>>>>>> deal with
>>>>>>> line delimiters.
>>>>>> I'm not sure I understand what are the 2 params passed to 
>>>>>> isNewLine and what code will invoke this service.
>>>>>>
>>>>> 2 consecutive characters read from the data stream or -1 if any of 
>>>>> those
>>>>> characters is not available. 
>>>> so "a\r\nb" would result in the calls:
>>>> isNewLine(-1,'a');
>>>> isNewLine('a','\r');
>>>> isNewLine('\r','\n');
>>>> isNewLine('\n','b');
>>>> isNewLine('b',-1);
>>>> is this correct? What would be the result for the 5 above from the 
>>>> implementation that will be fine in HTTP?
>>>>
>>>
>>> Anything that allows:
>>>
>>> line delimiter = (LF|CRLF)
>>
>> I understood this, but I'm not following you on how your do this with 
>> the Interface you was proposing.
>> Given your rule you have true on the 3rd and the 4th call? Wouldn't 
>> this result in 2 newlines?
>>
> 
> I do not think so, only a sequence with ch2 = '\n' would be considered a 
> valid line delimiter. I realized, though, the problem with this 
> interface is that it implied a one byte read I had thought we wanted to 
> get rid of.

I understand it now, thank you!

>>>>>>> (2) The issue of CR / LF handling in content bodies should be 
>>>>>>> taken of
>>>>>>> when formatting output, _not_ when parsing input.
>>>>>>>
>>>>>>> Would that work for you?
>>>>>> I'm not sure this is enough.
>>>>>> In output we format what we parser: if we parsed the input as 
>>>>>> multiple lines then we output multiple lines, otherwise we output 
>>>>>> a single line. So it is during parsing that we have to decide 
>>>>>> whether an isolated LF is a newline delimiter or not.
>>>>> But mime4j does not parse _content bodies_ as multiple lines, does it?
>>>> TextBody.getReader()
>>>>
>>>
>>> But that does not necessarily imply parsing into multiple lines, does
>>> it? Anyways, I perfectly am fine with TexyBody automatically converting
>>> line delimiters. IMHO this is the right place to do the conversion, but
>>> not the MimeTokenStream
>>
>> You are right, the Reader does not imply line parsing, but anyway 
>> somewhere we have to deal with lines.
>> Mime4J basic classes (the whole LineReaderInputStream hierarchy) have 
>> indeed a readLine method. This just made me realize that the internal 
>> buffer is filled with lines and that sending a very long binary make 
>> mime4j die with OOM.
> 
> No, it would not. Binary content is not read line by line. The #readLine 
> method is only used when parsing metadata (header fields), where we do 
> need to put a cap on the max line length, as discussed before.

My fault: I had code casting to LineReaderInputStream and using readLine 
to get the content, but the method indeed returned me only an 
InputStream and there is no way to throw the OOM without using a cast.

About the line length limit we really need it: a random sequence of 
non-LF chars currently make our code to throw an OOM.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
>> On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
>>> Oleg Kalnichevski ha scritto:
>>>> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>>>>> Oleg Kalnichevski ha scritto:
>>>>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>>> Stefano Bagnara wrote:
>>>> ...
>>>>
>>>>> As I said the strict mode would only be useful to users of mime4j 
>>>>> wanting to use mime4j as a validator to check RFC compliance. You 
>>>>> know, mime4j born for SMTP, but now you need it for HTTP and 
>>>>> someone else may want to do a validator. So let's not keep our eyes 
>>>>> closed once again.
>>>>>
>>>> OK, I fail to see any practical benefit of that aside from a nice warm
>>>> feeling about being 100% compliant, but I admit I am biased.
>>>>
>>>>>> Anyways, let's talk code now. How about this?
>>>>>>
>>>>>> (1)
>>>>>>
>>>>>> interface LineDelimiterStrategy {
>>>>>>
>>>>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>>>>>     throws MimeException;
>>>>>>
>>>>>> }
>>>>>>
>>>>>> One can provide MimeTokenStream with an implementation of this 
>>>>>> interface
>>>>>> at the construction time. MimeTokenStream it its turn passes a
>>>>>> reference to that class to all parser components that need to deal 
>>>>>> with
>>>>>> line delimiters.
>>>>> I'm not sure I understand what are the 2 params passed to isNewLine 
>>>>> and what code will invoke this service.
>>>>>
>>>> 2 consecutive characters read from the data stream or -1 if any of 
>>>> those
>>>> characters is not available. 
>>> so "a\r\nb" would result in the calls:
>>> isNewLine(-1,'a');
>>> isNewLine('a','\r');
>>> isNewLine('\r','\n');
>>> isNewLine('\n','b');
>>> isNewLine('b',-1);
>>> is this correct? What would be the result for the 5 above from the 
>>> implementation that will be fine in HTTP?
>>>
>>
>> Anything that allows:
>>
>> line delimiter = (LF|CRLF)
> 
> I understood this, but I'm not following you on how your do this with 
> the Interface you was proposing.
> Given your rule you have true on the 3rd and the 4th call? Wouldn't this 
> result in 2 newlines?
> 

I do not think so, only a sequence with ch2 = '\n' would be considered a 
valid line delimiter. I realized, though, the problem with this 
interface is that it implied a one byte read I had thought we wanted to 
get rid of.


>>>>>> (2) The issue of CR / LF handling in content bodies should be 
>>>>>> taken of
>>>>>> when formatting output, _not_ when parsing input.
>>>>>>
>>>>>> Would that work for you?
>>>>> I'm not sure this is enough.
>>>>> In output we format what we parser: if we parsed the input as 
>>>>> multiple lines then we output multiple lines, otherwise we output a 
>>>>> single line. So it is during parsing that we have to decide whether 
>>>>> an isolated LF is a newline delimiter or not.
>>>> But mime4j does not parse _content bodies_ as multiple lines, does it?
>>> TextBody.getReader()
>>>
>>
>> But that does not necessarily imply parsing into multiple lines, does
>> it? Anyways, I perfectly am fine with TexyBody automatically converting
>> line delimiters. IMHO this is the right place to do the conversion, but
>> not the MimeTokenStream
> 
> You are right, the Reader does not imply line parsing, but anyway 
> somewhere we have to deal with lines.
> Mime4J basic classes (the whole LineReaderInputStream hierarchy) have 
> indeed a readLine method. This just made me realize that the internal 
> buffer is filled with lines and that sending a very long binary make 
> mime4j die with OOM.

No, it would not. Binary content is not read line by line. The #readLine 
method is only used when parsing metadata (header fields), where we do 
need to put a cap on the max line length, as discussed before.

Oleg


  We can fix this OOM during standard parsing by
> having an hard limit on the size (and throwing exception otherwise) but 
> we have to do this differently during the streaming of "binary" encoded 
> parts (line reading makes no sense there).
> 
> Furthermore, at the very minimum we have a RootInputStream only counting 
> lines if they are CRLF terminated. It seems weird that we count lines 
> only if their are CRLF terminated but we recognize them also if they are 
> LF ending (this is one more issue to be taken in consideration, not the 
> one we was talking about).
> 
> Stefano
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>>>> Stefano Bagnara wrote:
>>> ...
>>>
>>>> As I said the strict mode would only be useful to users of mime4j 
>>>> wanting to use mime4j as a validator to check RFC compliance. You know, 
>>>> mime4j born for SMTP, but now you need it for HTTP and someone else may 
>>>> want to do a validator. So let's not keep our eyes closed once again.
>>>>
>>> OK, I fail to see any practical benefit of that aside from a nice warm
>>> feeling about being 100% compliant, but I admit I am biased.
>>>
>>>>> Anyways, let's talk code now. How about this?
>>>>>
>>>>> (1)
>>>>>
>>>>> interface LineDelimiterStrategy {
>>>>>
>>>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>>>> 	throws MimeException;
>>>>>
>>>>> }
>>>>>
>>>>> One can provide MimeTokenStream with an implementation of this interface
>>>>> at the construction time. MimeTokenStream it its turn passes a
>>>>> reference to that class to all parser components that need to deal with
>>>>> line delimiters.
>>>> I'm not sure I understand what are the 2 params passed to isNewLine and 
>>>> what code will invoke this service.
>>>>
>>> 2 consecutive characters read from the data stream or -1 if any of those
>>> characters is not available. 
>> so "a\r\nb" would result in the calls:
>> isNewLine(-1,'a');
>> isNewLine('a','\r');
>> isNewLine('\r','\n');
>> isNewLine('\n','b');
>> isNewLine('b',-1);
>> is this correct? What would be the result for the 5 above from the 
>> implementation that will be fine in HTTP?
>>
> 
> Anything that allows:
> 
> line delimiter = (LF|CRLF)

I understood this, but I'm not following you on how your do this with 
the Interface you was proposing.
Given your rule you have true on the 3rd and the 4th call? Wouldn't this 
result in 2 newlines?

>>>>> (2) The issue of CR / LF handling in content bodies should be taken of
>>>>> when formatting output, _not_ when parsing input.
>>>>>
>>>>> Would that work for you?
>>>> I'm not sure this is enough.
>>>> In output we format what we parser: if we parsed the input as multiple 
>>>> lines then we output multiple lines, otherwise we output a single line. 
>>>> So it is during parsing that we have to decide whether an isolated LF is 
>>>> a newline delimiter or not.
>>> But mime4j does not parse _content bodies_ as multiple lines, does it?
>> TextBody.getReader()
>>
> 
> But that does not necessarily imply parsing into multiple lines, does
> it? Anyways, I perfectly am fine with TexyBody automatically converting
> line delimiters. IMHO this is the right place to do the conversion, but
> not the MimeTokenStream

You are right, the Reader does not imply line parsing, but anyway 
somewhere we have to deal with lines.
Mime4J basic classes (the whole LineReaderInputStream hierarchy) have 
indeed a readLine method. This just made me realize that the internal 
buffer is filled with lines and that sending a very long binary make 
mime4j die with OOM. We can fix this OOM during standard parsing by 
having an hard limit on the size (and throwing exception otherwise) but 
we have to do this differently during the streaming of "binary" encoded 
parts (line reading makes no sense there).

Furthermore, at the very minimum we have a RootInputStream only counting 
lines if they are CRLF terminated. It seems weird that we count lines 
only if their are CRLF terminated but we recognize them also if they are 
LF ending (this is one more issue to be taken in consideration, not the 
one we was talking about).

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2008-07-18 at 16:19 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
> >>>> Oleg Kalnichevski ha scritto:
> >>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >>>>>> Oleg Kalnichevski ha scritto:
> >>>>>>> Stefano Bagnara wrote:
> > 
> > ...
> > 
> >> As I said the strict mode would only be useful to users of mime4j 
> >> wanting to use mime4j as a validator to check RFC compliance. You know, 
> >> mime4j born for SMTP, but now you need it for HTTP and someone else may 
> >> want to do a validator. So let's not keep our eyes closed once again.
> >>
> > 
> > OK, I fail to see any practical benefit of that aside from a nice warm
> > feeling about being 100% compliant, but I admit I am biased.
> > 
> >>> Anyways, let's talk code now. How about this?
> >>>
> >>> (1)
> >>>
> >>> interface LineDelimiterStrategy {
> >>>
> >>>  boolean isNewLine(char ch1, char ch2) // both can be -1
> >>> 	throws MimeException;
> >>>
> >>> }
> >>>
> >>> One can provide MimeTokenStream with an implementation of this interface
> >>> at the construction time. MimeTokenStream it its turn passes a
> >>> reference to that class to all parser components that need to deal with
> >>> line delimiters.
> >> I'm not sure I understand what are the 2 params passed to isNewLine and 
> >> what code will invoke this service.
> >>
> > 
> > 2 consecutive characters read from the data stream or -1 if any of those
> > characters is not available. 
> 
> so "a\r\nb" would result in the calls:
> isNewLine(-1,'a');
> isNewLine('a','\r');
> isNewLine('\r','\n');
> isNewLine('\n','b');
> isNewLine('b',-1);
> is this correct? What would be the result for the 5 above from the 
> implementation that will be fine in HTTP?
> 

Anything that allows:

line delimiter = (LF|CRLF)


> >>> (2) The issue of CR / LF handling in content bodies should be taken of
> >>> when formatting output, _not_ when parsing input.
> >>>
> >>> Would that work for you?
> >> I'm not sure this is enough.
> >> In output we format what we parser: if we parsed the input as multiple 
> >> lines then we output multiple lines, otherwise we output a single line. 
> >> So it is during parsing that we have to decide whether an isolated LF is 
> >> a newline delimiter or not.
> > 
> > But mime4j does not parse _content bodies_ as multiple lines, does it?
> 
> TextBody.getReader()
> 

But that does not necessarily imply parsing into multiple lines, does
it? Anyways, I perfectly am fine with TexyBody automatically converting
line delimiters. IMHO this is the right place to do the conversion, but
not the MimeTokenStream

> > At this point I think I have to give up. Whatever you end up doing
> > _please_ do not wrap the raw data stream with EOLConvertingInputStream.
> 
> Sure, I already excluded this: I now understand the "C-T-E: binary" issue.
> BTW I hope you will keep monitoring this issue so you can confirm 
> whatever solution we propose will be fine with your library?
> 

Sure.

Oleg


> Thank you,
> Stefano
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>>>> Oleg Kalnichevski ha scritto:
>>>>>>> Stefano Bagnara wrote:
> 
> ...
> 
>> As I said the strict mode would only be useful to users of mime4j 
>> wanting to use mime4j as a validator to check RFC compliance. You know, 
>> mime4j born for SMTP, but now you need it for HTTP and someone else may 
>> want to do a validator. So let's not keep our eyes closed once again.
>>
> 
> OK, I fail to see any practical benefit of that aside from a nice warm
> feeling about being 100% compliant, but I admit I am biased.
> 
>>> Anyways, let's talk code now. How about this?
>>>
>>> (1)
>>>
>>> interface LineDelimiterStrategy {
>>>
>>>  boolean isNewLine(char ch1, char ch2) // both can be -1
>>> 	throws MimeException;
>>>
>>> }
>>>
>>> One can provide MimeTokenStream with an implementation of this interface
>>> at the construction time. MimeTokenStream it its turn passes a
>>> reference to that class to all parser components that need to deal with
>>> line delimiters.
>> I'm not sure I understand what are the 2 params passed to isNewLine and 
>> what code will invoke this service.
>>
> 
> 2 consecutive characters read from the data stream or -1 if any of those
> characters is not available. 

so "a\r\nb" would result in the calls:
isNewLine(-1,'a');
isNewLine('a','\r');
isNewLine('\r','\n');
isNewLine('\n','b');
isNewLine('b',-1);
is this correct? What would be the result for the 5 above from the 
implementation that will be fine in HTTP?

>>> (2) The issue of CR / LF handling in content bodies should be taken of
>>> when formatting output, _not_ when parsing input.
>>>
>>> Would that work for you?
>> I'm not sure this is enough.
>> In output we format what we parser: if we parsed the input as multiple 
>> lines then we output multiple lines, otherwise we output a single line. 
>> So it is during parsing that we have to decide whether an isolated LF is 
>> a newline delimiter or not.
> 
> But mime4j does not parse _content bodies_ as multiple lines, does it?

TextBody.getReader()

> At this point I think I have to give up. Whatever you end up doing
> _please_ do not wrap the raw data stream with EOLConvertingInputStream.

Sure, I already excluded this: I now understand the "C-T-E: binary" issue.
BTW I hope you will keep monitoring this issue so you can confirm 
whatever solution we propose will be fine with your library?

Thank you,
Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2008-07-18 at 14:45 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >>>> Oleg Kalnichevski ha scritto:
> >>>>> Stefano Bagnara wrote:
> > 

...

> As I said the strict mode would only be useful to users of mime4j 
> wanting to use mime4j as a validator to check RFC compliance. You know, 
> mime4j born for SMTP, but now you need it for HTTP and someone else may 
> want to do a validator. So let's not keep our eyes closed once again.
> 

OK, I fail to see any practical benefit of that aside from a nice warm
feeling about being 100% compliant, but I admit I am biased.

> > Anyways, let's talk code now. How about this?
> > 
> > (1)
> > 
> > interface LineDelimiterStrategy {
> > 
> >  boolean isNewLine(char ch1, char ch2) // both can be -1
> > 	throws MimeException;
> > 
> > }
> > 
> > One can provide MimeTokenStream with an implementation of this interface
> > at the construction time. MimeTokenStream it its turn passes a
> > reference to that class to all parser components that need to deal with
> > line delimiters.
> 
> I'm not sure I understand what are the 2 params passed to isNewLine and 
> what code will invoke this service.
> 

2 consecutive characters read from the data stream or -1 if any of those
characters is not available. 


> > (2) The issue of CR / LF handling in content bodies should be taken of
> > when formatting output, _not_ when parsing input.
> > 
> > Would that work for you?
> 
> I'm not sure this is enough.
> In output we format what we parser: if we parsed the input as multiple 
> lines then we output multiple lines, otherwise we output a single line. 
> So it is during parsing that we have to decide whether an isolated LF is 
> a newline delimiter or not.

But mime4j does not parse _content bodies_ as multiple lines, does it?

At this point I think I have to give up. Whatever you end up doing
_please_ do not wrap the raw data stream with EOLConvertingInputStream.

Cheers

Oleg 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> Stefano Bagnara wrote:
> 
> ...
> 
>> E.g: I'm slowly coming to a possible proposal about parsing.
>> - strict mode: no conversion is done, a CR or LF in headers (or other 
>> non 7bit content) make mime4j fail parsing.
>> - permissive modes:
>>    - default binary: no conversion happen, isolated CR and LF are 
>> accepted everywhere but not considered newlines (as like as other 8bit 
>> bytes), the default content-transfer-encoding is "binary" when not 
>> specified (7bit, 8bit and binary are read as binary).
>>    - default text: we convert isolated CR and LF to CRLF almost 
>> everywhere but in "binary" content-transfer-encoding parts.
>> I'm not proposing this yet (not sure this is enough and we don't need 
>> more granular tweakings), but this is something I'm evaluating right 
>> now... The strict mode is desiderable to have, but less important than 
>> the permissive parsing (we want to be strict in output, not in input). 
>> OTOH someone may want to use mime4j for validating if a content is 
>> wellformed or not (wrt RFC) and in this case a strict mode would be 
>> necessary.
>>
>> Stefano
>>
> 
> Stefano,
> 
> With all due respect but I see strict handling of line delimiters as
> _pointless_ orthodoxy that really does not help anyone. Would you really
> ship an application to a client of yours that rejects a message as
> invalid because it contains a lone LF in it? So what is the _point_ of
> being strict about line delimiters?

As I said the strict mode would only be useful to users of mime4j 
wanting to use mime4j as a validator to check RFC compliance. You know, 
mime4j born for SMTP, but now you need it for HTTP and someone else may 
want to do a validator. So let's not keep our eyes closed once again.

> Anyways, let's talk code now. How about this?
> 
> (1)
> 
> interface LineDelimiterStrategy {
> 
>  boolean isNewLine(char ch1, char ch2) // both can be -1
> 	throws MimeException;
> 
> }
> 
> One can provide MimeTokenStream with an implementation of this interface
> at the construction time. MimeTokenStream it its turn passes a
> reference to that class to all parser components that need to deal with
> line delimiters.

I'm not sure I understand what are the 2 params passed to isNewLine and 
what code will invoke this service.

> (2) The issue of CR / LF handling in content bodies should be taken of
> when formatting output, _not_ when parsing input.
> 
> Would that work for you?

I'm not sure this is enough.
In output we format what we parser: if we parsed the input as multiple 
lines then we output multiple lines, otherwise we output a single line. 
So it is during parsing that we have to decide whether an isolated LF is 
a newline delimiter or not.
This isssue is very related to charset: when you read a content you have 
to deal with charset during parsing, you cannot do that during 
formatting output. So if you find something invalid for that charset you 
have to deal with it during parsing and not during output formatting.

I think this document (excerpt from RFC1521) is key to create an opinion 
about the best approach: 
http://www.math-inf.uni-greifswald.de/~teumer/mime/1521/Appendix_G.html

I hope we get some more opinion from other contributor so we have 
multiple "interpretation" of what is the best thing to do, too.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2008-07-18 at 10:58 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> Stefano Bagnara wrote:
> > 

...

> E.g: I'm slowly coming to a possible proposal about parsing.
> - strict mode: no conversion is done, a CR or LF in headers (or other 
> non 7bit content) make mime4j fail parsing.
> - permissive modes:
>    - default binary: no conversion happen, isolated CR and LF are 
> accepted everywhere but not considered newlines (as like as other 8bit 
> bytes), the default content-transfer-encoding is "binary" when not 
> specified (7bit, 8bit and binary are read as binary).
>    - default text: we convert isolated CR and LF to CRLF almost 
> everywhere but in "binary" content-transfer-encoding parts.
> I'm not proposing this yet (not sure this is enough and we don't need 
> more granular tweakings), but this is something I'm evaluating right 
> now... The strict mode is desiderable to have, but less important than 
> the permissive parsing (we want to be strict in output, not in input). 
> OTOH someone may want to use mime4j for validating if a content is 
> wellformed or not (wrt RFC) and in this case a strict mode would be 
> necessary.
> 
> Stefano
> 

Stefano,

With all due respect but I see strict handling of line delimiters as
_pointless_ orthodoxy that really does not help anyone. Would you really
ship an application to a client of yours that rejects a message as
invalid because it contains a lone LF in it? So what is the _point_ of
being strict about line delimiters?

Anyways, let's talk code now. How about this?

(1)

interface LineDelimiterStrategy {

 boolean isNewLine(char ch1, char ch2) // both can be -1
	throws MimeException;

}

One can provide MimeTokenStream with an implementation of this interface
at the construction time. MimeTokenStream it its turn passes a
reference to that class to all parser components that need to deal with
line delimiters.

(2) The issue of CR / LF handling in content bodies should be taken of
when formatting output, _not_ when parsing input.

Would that work for you?

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> Stefano Bagnara wrote:
> 
> ...
> 
>>> Not only does this change completely reverts the performance gains and 
>>> makes the whole refactroring exercise completely pointless due to an 
>>> utterly inefficient implementation of EOLConvertingInputStream, it is 
>>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
>>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
>>> renders mime4j incompatible with commons browsers and HttpClient
>> The performance of the EOLConvertingInputStream is not important at all 
>> if removing it we have an unusable library. 
> 
> And the last thing. This kind of argument works both ways. The strict
> RFC compliance is not important if we have an unusable library as a
> result.

Oleg, I agree with you! I'm well aware of this.
I think that slowly this discussion is givin a bit more knowledge to 
judge what is the right compromise between strict behaviour, permissive 
interoperabily and compliance.

Most time there is no need to be non-compliant to support permissive 
interoperability but we just need to be less strict.

I hope you understand I'm not fighting your patch/changes and I'm even 
much more far from fighting you (in fact I like you because you provide 
code and not complaints!). I want to make sure we do the right thing 
because we understand it or if we do the wrong thing I want to be sure 
we understand what we are doing and agree that even if it is wrong is 
acceptable to us.

E.g: I'm slowly coming to a possible proposal about parsing.
- strict mode: no conversion is done, a CR or LF in headers (or other 
non 7bit content) make mime4j fail parsing.
- permissive modes:
   - default binary: no conversion happen, isolated CR and LF are 
accepted everywhere but not considered newlines (as like as other 8bit 
bytes), the default content-transfer-encoding is "binary" when not 
specified (7bit, 8bit and binary are read as binary).
   - default text: we convert isolated CR and LF to CRLF almost 
everywhere but in "binary" content-transfer-encoding parts.
I'm not proposing this yet (not sure this is enough and we don't need 
more granular tweakings), but this is something I'm evaluating right 
now... The strict mode is desiderable to have, but less important than 
the permissive parsing (we want to be strict in output, not in input). 
OTOH someone may want to use mime4j for validating if a content is 
wellformed or not (wrt RFC) and in this case a strict mode would be 
necessary.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > Stefano Bagnara wrote:

...

> > Not only does this change completely reverts the performance gains and 
> > makes the whole refactroring exercise completely pointless due to an 
> > utterly inefficient implementation of EOLConvertingInputStream, it is 
> > also conceptually wrong (in my humble opinion), as it causes mime4j to 
> > corrupt 8bit encoded 'application/octet-stream' content. This basically 
> > renders mime4j incompatible with commons browsers and HttpClient
> 
> The performance of the EOLConvertingInputStream is not important at all 
> if removing it we have an unusable library. 

And the last thing. This kind of argument works both ways. The strict
RFC compliance is not important if we have an unusable library as a
result.

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Fri, 2008-07-18 at 09:56 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>>>> Oleg Kalnichevski ha scritto:
>>>>> Not only does this change completely reverts the performance gains and 
>>>>> makes the whole refactroring exercise completely pointless due to an 
>>>>> utterly inefficient implementation of EOLConvertingInputStream, it is 
>>>>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
>>>>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
>>>>> renders mime4j incompatible with commons browsers and HttpClient
>>>> The performance of the EOLConvertingInputStream is not important at all 
>>>> if removing it we have an unusable library. So let's talk about what we 
>>>> expect from the library, then we'll discuss how to make it performant. I 
>>>> believe we have technical skills to make a performant EOLConverting stream.
>>>>
>>>> About the 8bit encoded 'application/octet-stream' I think we just need 
>>>> to find the right RFC telling us what we have to do: the RFC I read 
>>>> about MIME and its applications always tell that CR and LF must not be 
>>>> alone and that the appropriate transfer encoding have to be used in 
>>>> order to avoid isolated LF and CR: it is not a matter of personal 
>>>> preferences, it is a matter of rfc compliance. Let's find the docs, first.
>>>>
>>>> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
>>>> -------------------
>>>> "8bit data" refers to data that is all represented as relatively
>>>> short lines with 998 octets or less between CRLF line separation
>>>> sequences [RFC-821]), but octets with decimal values greater than 127
>>>> may be used.  As with "7bit data" CR and LF octets only occur as part
>>>> of CRLF line separation sequences and no NULs are allowed.
>>>> -------------------
>>>>
>>> Stefano,
>>>
>>> You are very welcome to impose whatever strict interpretation of the
>>> relevant RFCs are your hearts desires. Just please leave on option
>>> allowing to override it so that the mime4j parser could be used to parse
>>> real-world content.
>> Oleg, don't take me wrong. I simply want to make sure we all understand 
>> what RFC say and understand the specific cases we are ignoring it and WHY.
>>
>> In the case of outer boundary we introduced backward compatibility 
>> issues in the name of performance mainly because of lack of knowledge of 
>> the RFCs. I'm not an expert, too, but I think it is important to at 
>> least take them into consideration once we find the right docs.
>>
>> I'm not saying that we MUST be 100% compliant and strict, but I want to 
>> make sure we know when we are doing something not compliant and that we 
>> agree that it is good.
>>
>> One of the main goal is interoperability, so everytime we do something 
>> different from what RFC tell us we have to make sure we are not breaking 
>> interoperability with other RFC compliant tools.
>>
>> I'm far from being a MIME expert, so I find it difficult to keep up with 
>> this discussion if I have to convince people of something. I just want 
>> to share my little knowledge about the (mainly SMTP related) RFCs.
>>
>> Stefano
>>
> 
> Stefano,
> 
> The core of this issue is not about standards compliance. I am fine with
> mime4j being strict in its interpretation of relevant RFCs per default.
> However, the idea of _indiscriminate_ conversion of line delimiters
> regardless of their occurrence in the data stream seems _very_, _very_
> __conceptually__ wrong to me.
> 
> I can't help feeling that Ayatollah style orthodoxy about line
> delimiters handling just does not really help anyone. Fortunately for
> JAMES, MTAs an MUAs are too complex to be written by complete muppets.
> We do not have that privilege in the HTTP world where one has no other
> choice but to interoperate with tons of HTTP agents and CGI scripts
> written with a complete disregard of standards. So, in the
> HttpComponents project we have a very simple policy: be lenient about
> parsing, be strict about formatting. That seems to work well for _us_.
> 
> Oleg

"be lenient about parsing, be strict about formatting" is exactly what 
the JAMES PMC agreed in the guidelines.

Conversion should not be done at all but we want to be lenient so we do 
some conversion to support some non compliant agent. I also agree that 
conversion may not be appropriate in any case, and that's why we 
discuss. It doesn't worth keep discussing this issue at this high level 
and instead I would like to keep our focus on real solutions.

In fact, after reading a lot of RFC today, I think that what you get 
from HTTP is perfectly standard behaviour (I'm not sure if they miss the 
"Content-Transfer-Encoding: binary" header of if some RFC define it as 
the default for HTTP) but I found rfc1867 telling that it is common to 
use the binary transfer encoding in multipart/form-data mime parts in 
HTTP, so the fact is probably that what you want is what the RFC ask us 
to implement, but we first understand things and then do things ;-)

And be sure that the same issues you find with HTTP client exists also 
with MUA and MTA. Muppets are all around and we care for RFC so much 
exactly because we don't want other people to call us muppets ;-)

Please read other messages I posted in this thread today, because I 
think they are more concrete and propositive than this leaf of the thread.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2008-07-18 at 09:56 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> >> Oleg Kalnichevski ha scritto:
> >>> Not only does this change completely reverts the performance gains and 
> >>> makes the whole refactroring exercise completely pointless due to an 
> >>> utterly inefficient implementation of EOLConvertingInputStream, it is 
> >>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
> >>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
> >>> renders mime4j incompatible with commons browsers and HttpClient
> >> The performance of the EOLConvertingInputStream is not important at all 
> >> if removing it we have an unusable library. So let's talk about what we 
> >> expect from the library, then we'll discuss how to make it performant. I 
> >> believe we have technical skills to make a performant EOLConverting stream.
> >>
> >> About the 8bit encoded 'application/octet-stream' I think we just need 
> >> to find the right RFC telling us what we have to do: the RFC I read 
> >> about MIME and its applications always tell that CR and LF must not be 
> >> alone and that the appropriate transfer encoding have to be used in 
> >> order to avoid isolated LF and CR: it is not a matter of personal 
> >> preferences, it is a matter of rfc compliance. Let's find the docs, first.
> >>
> >> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
> >> -------------------
> >> "8bit data" refers to data that is all represented as relatively
> >> short lines with 998 octets or less between CRLF line separation
> >> sequences [RFC-821]), but octets with decimal values greater than 127
> >> may be used.  As with "7bit data" CR and LF octets only occur as part
> >> of CRLF line separation sequences and no NULs are allowed.
> >> -------------------
> >>
> > 
> > Stefano,
> > 
> > You are very welcome to impose whatever strict interpretation of the
> > relevant RFCs are your hearts desires. Just please leave on option
> > allowing to override it so that the mime4j parser could be used to parse
> > real-world content.
> 
> Oleg, don't take me wrong. I simply want to make sure we all understand 
> what RFC say and understand the specific cases we are ignoring it and WHY.
> 
> In the case of outer boundary we introduced backward compatibility 
> issues in the name of performance mainly because of lack of knowledge of 
> the RFCs. I'm not an expert, too, but I think it is important to at 
> least take them into consideration once we find the right docs.
> 
> I'm not saying that we MUST be 100% compliant and strict, but I want to 
> make sure we know when we are doing something not compliant and that we 
> agree that it is good.
> 
> One of the main goal is interoperability, so everytime we do something 
> different from what RFC tell us we have to make sure we are not breaking 
> interoperability with other RFC compliant tools.
> 
> I'm far from being a MIME expert, so I find it difficult to keep up with 
> this discussion if I have to convince people of something. I just want 
> to share my little knowledge about the (mainly SMTP related) RFCs.
> 
> Stefano
> 

Stefano,

The core of this issue is not about standards compliance. I am fine with
mime4j being strict in its interpretation of relevant RFCs per default.
However, the idea of _indiscriminate_ conversion of line delimiters
regardless of their occurrence in the data stream seems _very_, _very_
__conceptually__ wrong to me.

I can't help feeling that Ayatollah style orthodoxy about line
delimiters handling just does not really help anyone. Fortunately for
JAMES, MTAs an MUAs are too complex to be written by complete muppets.
We do not have that privilege in the HTTP world where one has no other
choice but to interoperate with tons of HTTP agents and CGI scripts
written with a complete disregard of standards. So, in the
HttpComponents project we have a very simple policy: be lenient about
parsing, be strict about formatting. That seems to work well for _us_.

Oleg



> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
>> Oleg Kalnichevski ha scritto:
>>> Not only does this change completely reverts the performance gains and 
>>> makes the whole refactroring exercise completely pointless due to an 
>>> utterly inefficient implementation of EOLConvertingInputStream, it is 
>>> also conceptually wrong (in my humble opinion), as it causes mime4j to 
>>> corrupt 8bit encoded 'application/octet-stream' content. This basically 
>>> renders mime4j incompatible with commons browsers and HttpClient
>> The performance of the EOLConvertingInputStream is not important at all 
>> if removing it we have an unusable library. So let's talk about what we 
>> expect from the library, then we'll discuss how to make it performant. I 
>> believe we have technical skills to make a performant EOLConverting stream.
>>
>> About the 8bit encoded 'application/octet-stream' I think we just need 
>> to find the right RFC telling us what we have to do: the RFC I read 
>> about MIME and its applications always tell that CR and LF must not be 
>> alone and that the appropriate transfer encoding have to be used in 
>> order to avoid isolated LF and CR: it is not a matter of personal 
>> preferences, it is a matter of rfc compliance. Let's find the docs, first.
>>
>> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
>> -------------------
>> "8bit data" refers to data that is all represented as relatively
>> short lines with 998 octets or less between CRLF line separation
>> sequences [RFC-821]), but octets with decimal values greater than 127
>> may be used.  As with "7bit data" CR and LF octets only occur as part
>> of CRLF line separation sequences and no NULs are allowed.
>> -------------------
>>
> 
> Stefano,
> 
> You are very welcome to impose whatever strict interpretation of the
> relevant RFCs are your hearts desires. Just please leave on option
> allowing to override it so that the mime4j parser could be used to parse
> real-world content.

Oleg, don't take me wrong. I simply want to make sure we all understand 
what RFC say and understand the specific cases we are ignoring it and WHY.

In the case of outer boundary we introduced backward compatibility 
issues in the name of performance mainly because of lack of knowledge of 
the RFCs. I'm not an expert, too, but I think it is important to at 
least take them into consideration once we find the right docs.

I'm not saying that we MUST be 100% compliant and strict, but I want to 
make sure we know when we are doing something not compliant and that we 
agree that it is good.

One of the main goal is interoperability, so everytime we do something 
different from what RFC tell us we have to make sure we are not breaking 
interoperability with other RFC compliant tools.

I'm far from being a MIME expert, so I find it difficult to keep up with 
this discussion if I have to convince people of something. I just want 
to share my little knowledge about the (mainly SMTP related) RFCs.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2008-07-17 at 20:21 +0200, Stefano Bagnara wrote:
> Oleg Kalnichevski ha scritto:
> > Stefano Bagnara wrote:
> >> Stefano Bagnara ha scritto:
> >>> I noticed that at a point in past the EOLConvertingInputStream has 
> >>> been removed from the chain.
> >>>
> >>> I think this create issues when we parse an input file having only \n 
> >>> and write it in output.
> >>>
> >>> - It seems that we parse most of the code only checking for \n (what 
> >>> does it happen when instead there are only \r? what should we do?)
> >>>
> > 
> > As far as I know a single CR is not used as a valid line delimiter 
> > anywhere. Please correct me if I am wrong.
> 
> AFAIK old MacOS (<X) use CR as their line delimiter.
> This is the same as unixes using LF.
> 
> >>> - If the message have only newlines it seems mime4j ends up 
> >>> outputting headers with CRLF and body with LF.
> > 
> > Why is it a problem? Headers serve a specific role. They convey metadata 
> > about a content body. The transport aspects of metadata are irrelevant, 
> > whereas one _usually_ does not want to a content body to go through a 
> > process of unnecessary transformation.
> 
> I don't understand what "specific role" is related to the RFC: I'm 
> talking about rfc compliance and real world cases as 2 different things. 
> First we have to understand what does it means to be RFC compliant and 
> what is a valid mime content and what is a valid "permissive parsing" 
> from the RFC PoV (as an example if we didn't read the rfc we now would 
> have a not compliant mime parser because of the outerboundaries not 
> having precedence on the nested boundaries).
> 
> You may know that there are specific MIME contents (e.g: delivery 
> notifications) having "header-style" lines in the content: so why 
> headers are different from the body? Why should we convert headers to 
> CRLF? Either we care about a compliant output or I don't understand why 
> we should put CRLF in headers.
> 

Stefano

You are very welcome to 


> >>> - If the input message have CR ending lines they are not considered 
> >>> by mime4j.
> >>>
> >>> IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
> >>>
> > 
> > I respectfully disagree.
> 
> That's good: disagreement allow discussion and allow us to understand 
> why something is good or bad.
> The important thing is that the mime4j community share a goal otherwise 
> each one will commit code diverging from the goal of the other.
> We cannot simply change the behaviour of mime4j because one user need 
> this without discussion or analysis.
> The RFC is our first resource, then we have real world use case to deal 
> with, and user requirements are on a third layer and have to comply with 
> previous requirement.
> 
> Maybe the right solution is making the behaviour configurable, I don't 
> know this, but I think that it's clear we need to discuss the issue 
> because otherwise we simply move away from the RFCs.
> 
> >>> If we do that we have to take care of encoded nested messages: they 
> >>> could have again LF, CR and CRLF like the top stream.
> >>>
> >>>
> >>> What is the right approach? Should we add a EOLConvertingInputStream 
> >>> (CONVERT_BOTH) to every level of parsing or should we fail to parse 
> >>> messages with bad newlines?
> >>>
> >>> I don't like the current behaviour where we accept some malformed 
> >>> data (LF alone are considered CRLF from our parser), we change some 
> >>> of them (the one between headers are converted to CRLF) and we still 
> >>> output malformed data.
> >>>
> >>> Opinions?
> >>
> >> I tried this patch and it seems to work fine (even if it breaks one of 
> >> our core tests that do not expect a CR in an header to be considered a 
> >> newline):
> >>
> > 
> > Not only does this change completely reverts the performance gains and 
> > makes the whole refactroring exercise completely pointless due to an 
> > utterly inefficient implementation of EOLConvertingInputStream, it is 
> > also conceptually wrong (in my humble opinion), as it causes mime4j to 
> > corrupt 8bit encoded 'application/octet-stream' content. This basically 
> > renders mime4j incompatible with commons browsers and HttpClient
> 
> The performance of the EOLConvertingInputStream is not important at all 
> if removing it we have an unusable library. So let's talk about what we 
> expect from the library, then we'll discuss how to make it performant. I 
> believe we have technical skills to make a performant EOLConverting stream.
> 
> About the 8bit encoded 'application/octet-stream' I think we just need 
> to find the right RFC telling us what we have to do: the RFC I read 
> about MIME and its applications always tell that CR and LF must not be 
> alone and that the appropriate transfer encoding have to be used in 
> order to avoid isolated LF and CR: it is not a matter of personal 
> preferences, it is a matter of rfc compliance. Let's find the docs, first.
> 
> What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
> -------------------
> "8bit data" refers to data that is all represented as relatively
> short lines with 998 octets or less between CRLF line separation
> sequences [RFC-821]), but octets with decimal values greater than 127
> may be used.  As with "7bit data" CR and LF octets only occur as part
> of CRLF line separation sequences and no NULs are allowed.
> -------------------
> 

Stefano,

You are very welcome to impose whatever strict interpretation of the
relevant RFCs are your hearts desires. Just please leave on option
allowing to override it so that the mime4j parser could be used to parse
real-world content.

Oleg

> So this would say that 8bit encoded 'application/octet-stream' have 
> anyway lines of 998 chars and does not include isolated CR and LF.
> 
> We have to understand if real world abused the 8bit specification or if 
> there is some mime extension we are not considering: this is important, 
> otherwise we will be the next abuser of the RFC. Apache JAMES PMC agreed 
> (in past, multiple times) that we have to make sure that we are strict 
> about mime written by mime4j and we are permissive with input.
> 
> > If you commit this change could you please provide an option to exclude 
> > EOLConvertingInputStream filter?
> 
> I'm not going to commit anything without agreement on what we want to 
> do. If *I* am the only one that care about the RFC we can even ignore 
> this thread at all, but my duty as PMC member is to raise similar issue 
> and to let the community decide.
> 
> > Thank you
> 
> Thank you, too!
> 
> > Oleg
> 
> Stefano
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Oleg Kalnichevski ha scritto:
> Stefano Bagnara wrote:
>> Stefano Bagnara ha scritto:
>>> I noticed that at a point in past the EOLConvertingInputStream has 
>>> been removed from the chain.
>>>
>>> I think this create issues when we parse an input file having only \n 
>>> and write it in output.
>>>
>>> - It seems that we parse most of the code only checking for \n (what 
>>> does it happen when instead there are only \r? what should we do?)
>>>
> 
> As far as I know a single CR is not used as a valid line delimiter 
> anywhere. Please correct me if I am wrong.

AFAIK old MacOS (<X) use CR as their line delimiter.
This is the same as unixes using LF.

>>> - If the message have only newlines it seems mime4j ends up 
>>> outputting headers with CRLF and body with LF.
> 
> Why is it a problem? Headers serve a specific role. They convey metadata 
> about a content body. The transport aspects of metadata are irrelevant, 
> whereas one _usually_ does not want to a content body to go through a 
> process of unnecessary transformation.

I don't understand what "specific role" is related to the RFC: I'm 
talking about rfc compliance and real world cases as 2 different things. 
First we have to understand what does it means to be RFC compliant and 
what is a valid mime content and what is a valid "permissive parsing" 
from the RFC PoV (as an example if we didn't read the rfc we now would 
have a not compliant mime parser because of the outerboundaries not 
having precedence on the nested boundaries).

You may know that there are specific MIME contents (e.g: delivery 
notifications) having "header-style" lines in the content: so why 
headers are different from the body? Why should we convert headers to 
CRLF? Either we care about a compliant output or I don't understand why 
we should put CRLF in headers.

>>> - If the input message have CR ending lines they are not considered 
>>> by mime4j.
>>>
>>> IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
>>>
> 
> I respectfully disagree.

That's good: disagreement allow discussion and allow us to understand 
why something is good or bad.
The important thing is that the mime4j community share a goal otherwise 
each one will commit code diverging from the goal of the other.
We cannot simply change the behaviour of mime4j because one user need 
this without discussion or analysis.
The RFC is our first resource, then we have real world use case to deal 
with, and user requirements are on a third layer and have to comply with 
previous requirement.

Maybe the right solution is making the behaviour configurable, I don't 
know this, but I think that it's clear we need to discuss the issue 
because otherwise we simply move away from the RFCs.

>>> If we do that we have to take care of encoded nested messages: they 
>>> could have again LF, CR and CRLF like the top stream.
>>>
>>>
>>> What is the right approach? Should we add a EOLConvertingInputStream 
>>> (CONVERT_BOTH) to every level of parsing or should we fail to parse 
>>> messages with bad newlines?
>>>
>>> I don't like the current behaviour where we accept some malformed 
>>> data (LF alone are considered CRLF from our parser), we change some 
>>> of them (the one between headers are converted to CRLF) and we still 
>>> output malformed data.
>>>
>>> Opinions?
>>
>> I tried this patch and it seems to work fine (even if it breaks one of 
>> our core tests that do not expect a CR in an header to be considered a 
>> newline):
>>
> 
> Not only does this change completely reverts the performance gains and 
> makes the whole refactroring exercise completely pointless due to an 
> utterly inefficient implementation of EOLConvertingInputStream, it is 
> also conceptually wrong (in my humble opinion), as it causes mime4j to 
> corrupt 8bit encoded 'application/octet-stream' content. This basically 
> renders mime4j incompatible with commons browsers and HttpClient

The performance of the EOLConvertingInputStream is not important at all 
if removing it we have an unusable library. So let's talk about what we 
expect from the library, then we'll discuss how to make it performant. I 
believe we have technical skills to make a performant EOLConverting stream.

About the 8bit encoded 'application/octet-stream' I think we just need 
to find the right RFC telling us what we have to do: the RFC I read 
about MIME and its applications always tell that CR and LF must not be 
alone and that the appropriate transfer encoding have to be used in 
order to avoid isolated LF and CR: it is not a matter of personal 
preferences, it is a matter of rfc compliance. Let's find the docs, first.

What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
-------------------
"8bit data" refers to data that is all represented as relatively
short lines with 998 octets or less between CRLF line separation
sequences [RFC-821]), but octets with decimal values greater than 127
may be used.  As with "7bit data" CR and LF octets only occur as part
of CRLF line separation sequences and no NULs are allowed.
-------------------

So this would say that 8bit encoded 'application/octet-stream' have 
anyway lines of 998 chars and does not include isolated CR and LF.

We have to understand if real world abused the 8bit specification or if 
there is some mime extension we are not considering: this is important, 
otherwise we will be the next abuser of the RFC. Apache JAMES PMC agreed 
(in past, multiple times) that we have to make sure that we are strict 
about mime written by mime4j and we are permissive with input.

> If you commit this change could you please provide an option to exclude 
> EOLConvertingInputStream filter?

I'm not going to commit anything without agreement on what we want to 
do. If *I* am the only one that care about the RFC we can even ignore 
this thread at all, but my duty as PMC member is to raise similar issue 
and to let the community decide.

> Thank you

Thank you, too!

> Oleg

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Oleg Kalnichevski <ol...@apache.org>.
Stefano Bagnara wrote:
> Stefano Bagnara ha scritto:
>> I noticed that at a point in past the EOLConvertingInputStream has 
>> been removed from the chain.
>>
>> I think this create issues when we parse an input file having only \n 
>> and write it in output.
>>
>> - It seems that we parse most of the code only checking for \n (what 
>> does it happen when instead there are only \r? what should we do?)
>>

As far as I know a single CR is not used as a valid line delimiter 
anywhere. Please correct me if I am wrong.


>> - If the message have only newlines it seems mime4j ends up outputting 
>> headers with CRLF and body with LF.
>>

Why is it a problem? Headers serve a specific role. They convey metadata 
about a content body. The transport aspects of metadata are irrelevant, 
whereas one _usually_ does not want to a content body to go through a 
process of unnecessary transformation.


>> - If the input message have CR ending lines they are not considered by 
>> mime4j.
>>
>> IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
>>

I respectfully disagree.


>> If we do that we have to take care of encoded nested messages: they 
>> could have again LF, CR and CRLF like the top stream.
>>
>>
>> What is the right approach? Should we add a EOLConvertingInputStream 
>> (CONVERT_BOTH) to every level of parsing or should we fail to parse 
>> messages with bad newlines?
>>
>> I don't like the current behaviour where we accept some malformed data 
>> (LF alone are considered CRLF from our parser), we change some of them 
>> (the one between headers are converted to CRLF) and we still output 
>> malformed data.
>>
>> Opinions?
> 
> I tried this patch and it seems to work fine (even if it breaks one of 
> our core tests that do not expect a CR in an header to be considered a 
> newline):
> 

Not only does this change completely reverts the performance gains and 
makes the whole refactroring exercise completely pointless due to an 
utterly inefficient implementation of EOLConvertingInputStream, it is 
also conceptually wrong (in my humble opinion), as it causes mime4j to 
corrupt 8bit encoded 'application/octet-stream' content. This basically 
renders mime4j incompatible with commons browsers and HttpClient

If you commit this change could you please provide an option to exclude 
EOLConvertingInputStream filter?

Thank you

Oleg


> Index: src/main/java/org/apache/james/mime4j/MimeEntity.java
> ===================================================================
> --- src/main/java/org/apache/james/mime4j/MimeEntity.java    (revision 
> 677582)
> +++ src/main/java/org/apache/james/mime4j/MimeEntity.java    (working copy)
> @@ -197,7 +197,7 @@
>          InputStream instream;
>          if (MimeUtil.isBase64Encoding(transferEncoding)) {
>              log.debug("base64 encoded message/rfc822 detected");
> -            instream = new Base64InputStream(dataStream);
> +            instream = new EOLConvertingInputStream(new 
> Base64InputStream(dataStream));
>          } else if (MimeUtil.isQuotedPrintableEncoded(transferEncoding)) {
>              log.debug("quoted-printable encoded message/rfc822 detected");
>              instream = new QuotedPrintableInputStream(dataStream);
> Index: src/main/java/org/apache/james/mime4j/MimeTokenStream.java
> ===================================================================
> --- src/main/java/org/apache/james/mime4j/MimeTokenStream.java    
> (revision 676846)
> +++ src/main/java/org/apache/james/mime4j/MimeTokenStream.java    
> (working copy)
> @@ -143,7 +143,7 @@
> 
>      private void doParse(InputStream stream, String contentType) {
>          entities.clear();
> -        rootInputStream = new RootInputStream(stream);
> +        rootInputStream = new RootInputStream(new 
> EOLConvertingInputStream(stream));
>          inbuffer = new BufferedLineReaderInputStream(rootInputStream, 4 
> * 1024);
>          switch (recursionMode) {
>          case M_RAW:
> 
> 
> IIRC the EOLConvertingInputStream was removed because of performance issue.
> 
> Stefano
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Posted by Stefano Bagnara <ap...@bago.org>.
Stefano Bagnara ha scritto:
> I noticed that at a point in past the EOLConvertingInputStream has been 
> removed from the chain.
> 
> I think this create issues when we parse an input file having only \n 
> and write it in output.
> 
> - It seems that we parse most of the code only checking for \n (what 
> does it happen when instead there are only \r? what should we do?)
> 
> - If the message have only newlines it seems mime4j ends up outputting 
> headers with CRLF and body with LF.
> 
> - If the input message have CR ending lines they are not considered by 
> mime4j.
> 
> IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
> 
> If we do that we have to take care of encoded nested messages: they 
> could have again LF, CR and CRLF like the top stream.
> 
> 
> What is the right approach? Should we add a EOLConvertingInputStream 
> (CONVERT_BOTH) to every level of parsing or should we fail to parse 
> messages with bad newlines?
> 
> I don't like the current behaviour where we accept some malformed data 
> (LF alone are considered CRLF from our parser), we change some of them 
> (the one between headers are converted to CRLF) and we still output 
> malformed data.
> 
> Opinions?

I tried this patch and it seems to work fine (even if it breaks one of 
our core tests that do not expect a CR in an header to be considered a 
newline):

Index: src/main/java/org/apache/james/mime4j/MimeEntity.java
===================================================================
--- src/main/java/org/apache/james/mime4j/MimeEntity.java	(revision 677582)
+++ src/main/java/org/apache/james/mime4j/MimeEntity.java	(working copy)
@@ -197,7 +197,7 @@
          InputStream instream;
          if (MimeUtil.isBase64Encoding(transferEncoding)) {
              log.debug("base64 encoded message/rfc822 detected");
-            instream = new Base64InputStream(dataStream); 

+            instream = new EOLConvertingInputStream(new 
Base64InputStream(dataStream));
          } else if (MimeUtil.isQuotedPrintableEncoded(transferEncoding)) {
              log.debug("quoted-printable encoded message/rfc822 detected");
              instream = new QuotedPrintableInputStream(dataStream); 

Index: src/main/java/org/apache/james/mime4j/MimeTokenStream.java
===================================================================
--- src/main/java/org/apache/james/mime4j/MimeTokenStream.java	(revision 
676846)
+++ src/main/java/org/apache/james/mime4j/MimeTokenStream.java	(working 
copy)
@@ -143,7 +143,7 @@

      private void doParse(InputStream stream, String contentType) {
          entities.clear();
-        rootInputStream = new RootInputStream(stream);
+        rootInputStream = new RootInputStream(new 
EOLConvertingInputStream(stream));
          inbuffer = new BufferedLineReaderInputStream(rootInputStream, 
4 * 1024);
          switch (recursionMode) {
          case M_RAW:


IIRC the EOLConvertingInputStream was removed because of performance issue.

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org