You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mime4j-dev@james.apache.org by Markus Wiederkehr <ma...@gmail.com> on 2009/02/16 14:42:29 UTC

Re: [jira] Assigned: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

In my opinion this issue is closely related to MIME4J-112 and MIME4J-116.

I think that in the course of MIME4J-116 we should (maybe) create
Field instances in AbstractEntity instead of later on in
MessageBuilder. A Field object could store the raw data in a byte[]
instead of a String which would greatly help with MIME4J-112.

The only problem is that the charset for a lenient parsing mode is not
known at this early point. But considering your clarification about
the lenient writing mode I wonder if anybody really needs a lenient
parsing mode. (I wonder if anyone really needs a lenient writing mode
for that matter.)

So maybe AbstractEntity should simply use US-ASCII to decode the
header fields without direct support for a lenient parsing mode that
nobody needs. Then AbstractEntity can build Field instances and a
ContentHandler receives those Field instances without having to parse
them again.

All in all I'm not sure if #118 should be addressed independently of
112 and 116 and whether 118 should be targeted for 0.6..

But those are just my 2 cents,

Markus

On Mon, Feb 16, 2009 at 1:27 PM, Oleg Kalnichevski (JIRA)
<mi...@james.apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Oleg Kalnichevski reassigned MIME4J-118:
> ----------------------------------------
>
>    Assignee: oleg.kalnichevski
>
> Working on a patch
>
> Oleg
>
>> MIME stream parser handles non-ASCII fields incorrectly
>> -------------------------------------------------------
>>
>>                 Key: MIME4J-118
>>                 URL: https://issues.apache.org/jira/browse/MIME4J-118
>>             Project: JAMES Mime4j
>>          Issue Type: Bug
>>            Reporter: Oleg Kalnichevski
>>            Assignee: oleg.kalnichevski
>>             Fix For: 0.6
>>
>>
>> Presently MIME stream parser handles non-ASCII fields incorrectly. Binary field content gets converted to its textual representation too early in the parsing process using simple byte to char cast. The decision about appropriate char encoding should be left up to individual ContentHandler implementations.
>> Oleg
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

Re: [jira] Assigned: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

Posted by Oleg Kalnichevski <ol...@apache.org>.

Markus Wiederkehr wrote:
> In my opinion this issue is closely related to MIME4J-112 and MIME4J-116.
> 
> I think that in the course of MIME4J-116 we should (maybe) create
> Field instances in AbstractEntity instead of later on in
> MessageBuilder. A Field object could store the raw data in a byte[]
> instead of a String which would greatly help with MIME4J-112.
> 

I would very much rather prefer to not couple MIME entity classes with 
Field, if possible.

> The only problem is that the charset for a lenient parsing mode is not
> known at this early point. But considering your clarification about
> the lenient writing mode I wonder if anybody really needs a lenient
> parsing mode. (I wonder if anyone really needs a lenient writing mode
> for that matter.)
> 
> So maybe AbstractEntity should simply use US-ASCII to decode the
> header fields without direct support for a lenient parsing mode that
> nobody needs. Then AbstractEntity can build Field instances and a
> ContentHandler receives those Field instances without having to parse
> them again.
> 
> All in all I'm not sure if #118 should be addressed independently of
> 112 and 116 and whether 118 should be targeted for 0.6..
> 

I personally dislike 'big-bang' style refactoring and prefer smaller 
incremental changes when lower level components get fixed first and 
remaining issues get sort of 'pushed' upwards to the higher level 
components.

I'll have a patch ready by tomorrow noon. If it gets rejected, let us 
revisit the idea of fixing #118, #112 and #116 all at the same time.

Cheers

Oleg

> But those are just my 2 cents,
> 
> Markus
> 
> 
> On Mon, Feb 16, 2009 at 1:27 PM, Oleg Kalnichevski (JIRA)
> <mi...@james.apache.org> wrote:
>>     [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Oleg Kalnichevski reassigned MIME4J-118:
>> ----------------------------------------
>>
>>    Assignee: oleg.kalnichevski
>>
>> Working on a patch
>>
>> Oleg
>>
>>> MIME stream parser handles non-ASCII fields incorrectly
>>> -------------------------------------------------------
>>>
>>>                 Key: MIME4J-118
>>>                 URL: https://issues.apache.org/jira/browse/MIME4J-118
>>>             Project: JAMES Mime4j
>>>          Issue Type: Bug
>>>            Reporter: Oleg Kalnichevski
>>>            Assignee: oleg.kalnichevski
>>>             Fix For: 0.6
>>>
>>>
>>> Presently MIME stream parser handles non-ASCII fields incorrectly. Binary field content gets converted to its textual representation too early in the parsing process using simple byte to char cast. The decision about appropriate char encoding should be left up to individual ContentHandler implementations.
>>> Oleg
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>

Re: [jira] Assigned: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

Posted by Markus Wiederkehr <ma...@gmail.com>.

On Mon, Feb 16, 2009 at 2:49 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Markus Wiederkehr ha scritto:
>> In my opinion this issue is closely related to MIME4J-112 and MIME4J-116.
>>
>> I think that in the course of MIME4J-116 we should (maybe) create
>> Field instances in AbstractEntity instead of later on in
>> MessageBuilder. A Field object could store the raw data in a byte[]
>> instead of a String which would greatly help with MIME4J-112.
>>
>> The only problem is that the charset for a lenient parsing mode is not
>> known at this early point. But considering your clarification about
>> the lenient writing mode I wonder if anybody really needs a lenient
>> parsing mode. (I wonder if anyone really needs a lenient writing mode
>> for that matter.)
>
> Lenient Writing IMO is only needed if you need roundtrip. For
> standard/most MIME4J usages I don't see why we should write malformed
> data in output.

In my opinion Field should preserve the original bytes in a byte
array. Writing a message could simply use these original bytes and
there would be no roundtrip issues. Essentially there would be only
one writing mode.

In additional I would like to have a "visitor" or whatever that can be
used to tidy up a message.

> Lenient reading instead is part of  being a generic parsing library:
> most email clients correctly handle 8bit chars in the Subject header
> because it happens than some email client writes them unencoded. If you
> think mime4j could be used as the library for an email client it
> probably still worth handling 8bit chars in the headers.
> Of course there is no need to implement such a feature until someone
> really ask/need it.

My approach would still allow for that with a little overhead. If a
ContentHandler receives a Field and that field contains the original
raw bytes then nothing prevents the ContentHandler from parsing the
fields again; using any charset determined by whatever means. Also
structured fields are parsed lazily so the overhead would not be
tremendous.

> I don't really know nowadays how many email messages contains unencoded
> headers. 10 years ago, when I checked this stuff deeply almost 40% of
> international emails included unencoded headers. I expect this
> percentage to be much less today, but I don't know if it is 10% or 0.1%.
>
> Stefano
>
>> So maybe AbstractEntity should simply use US-ASCII to decode the
>> header fields without direct support for a lenient parsing mode that
>> nobody needs. Then AbstractEntity can build Field instances and a
>> ContentHandler receives those Field instances without having to parse
>> them again.
>>
>> All in all I'm not sure if #118 should be addressed independently of
>> 112 and 116 and whether 118 should be targeted for 0.6..
>>
>> But those are just my 2 cents,
>>
>> Markus
>>
>>
>> On Mon, Feb 16, 2009 at 1:27 PM, Oleg Kalnichevski (JIRA)
>> <mi...@james.apache.org> wrote:
>>>     [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Oleg Kalnichevski reassigned MIME4J-118:
>>> ----------------------------------------
>>>
>>>    Assignee: oleg.kalnichevski
>>>
>>> Working on a patch
>>>
>>> Oleg
>>>
>>>> MIME stream parser handles non-ASCII fields incorrectly
>>>> -------------------------------------------------------
>>>>
>>>>                 Key: MIME4J-118
>>>>                 URL: https://issues.apache.org/jira/browse/MIME4J-118
>>>>             Project: JAMES Mime4j
>>>>          Issue Type: Bug
>>>>            Reporter: Oleg Kalnichevski
>>>>            Assignee: oleg.kalnichevski
>>>>             Fix For: 0.6
>>>>
>>>>
>>>> Presently MIME stream parser handles non-ASCII fields incorrectly. Binary field content gets converted to its textual representation too early in the parsing process using simple byte to char cast. The decision about appropriate char encoding should be left up to individual ContentHandler implementations.
>>>> Oleg
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>

Re: [jira] Assigned: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

Posted by Stefano Bagnara <ap...@bago.org>.

Markus Wiederkehr ha scritto:
> In my opinion this issue is closely related to MIME4J-112 and MIME4J-116.
> 
> I think that in the course of MIME4J-116 we should (maybe) create
> Field instances in AbstractEntity instead of later on in
> MessageBuilder. A Field object could store the raw data in a byte[]
> instead of a String which would greatly help with MIME4J-112.
> 
> The only problem is that the charset for a lenient parsing mode is not
> known at this early point. But considering your clarification about
> the lenient writing mode I wonder if anybody really needs a lenient
> parsing mode. (I wonder if anyone really needs a lenient writing mode
> for that matter.)

Lenient Writing IMO is only needed if you need roundtrip. For
standard/most MIME4J usages I don't see why we should write malformed
data in output.

Lenient reading instead is part of  being a generic parsing library:
most email clients correctly handle 8bit chars in the Subject header
because it happens than some email client writes them unencoded. If you
think mime4j could be used as the library for an email client it
probably still worth handling 8bit chars in the headers.
Of course there is no need to implement such a feature until someone
really ask/need it.

I don't really know nowadays how many email messages contains unencoded
headers. 10 years ago, when I checked this stuff deeply almost 40% of
international emails included unencoded headers. I expect this
percentage to be much less today, but I don't know if it is 10% or 0.1%.

Stefano

> So maybe AbstractEntity should simply use US-ASCII to decode the
> header fields without direct support for a lenient parsing mode that
> nobody needs. Then AbstractEntity can build Field instances and a
> ContentHandler receives those Field instances without having to parse
> them again.
> 
> All in all I'm not sure if #118 should be addressed independently of
> 112 and 116 and whether 118 should be targeted for 0.6..
> 
> But those are just my 2 cents,
> 
> Markus
> 
> 
> On Mon, Feb 16, 2009 at 1:27 PM, Oleg Kalnichevski (JIRA)
> <mi...@james.apache.org> wrote:
>>     [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Oleg Kalnichevski reassigned MIME4J-118:
>> ----------------------------------------
>>
>>    Assignee: oleg.kalnichevski
>>
>> Working on a patch
>>
>> Oleg
>>
>>> MIME stream parser handles non-ASCII fields incorrectly
>>> -------------------------------------------------------
>>>
>>>                 Key: MIME4J-118
>>>                 URL: https://issues.apache.org/jira/browse/MIME4J-118
>>>             Project: JAMES Mime4j
>>>          Issue Type: Bug
>>>            Reporter: Oleg Kalnichevski
>>>            Assignee: oleg.kalnichevski
>>>             Fix For: 0.6
>>>
>>>
>>> Presently MIME stream parser handles non-ASCII fields incorrectly. Binary field content gets converted to its textual representation too early in the parsing process using simple byte to char cast. The decision about appropriate char encoding should be left up to individual ContentHandler implementations.
>>> Oleg
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>