You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mime4j-dev@james.apache.org by Ondrej Bojar <bo...@ufal.mff.cuni.cz> on 2009/03/26 22:45:40 UTC

accented characters in e-mail addresses

Dear Mime4J developers,

I use android and both the builtin Email client and the K-9 replacement delete 
e-mail addresses containing accented characters. (If say "Pétér 
<pe...@peter.com>" sends me an e-mail and I hit 'Reply', the 'To' field becomes 
blank.)

I can barely read Java, but I understood from K-9 source they use your Mime4J 
for e-mail address parsing (and thus validation).

I was not able to compile the code downloaded from your site (I know nothing 
about Maven, I installed it but running 'mvn test' tried to download something 
and failed.)

I compiled K-9 (the source of which includes a version of mime4j) and I guess 
this exception is exactly the reason why they remove addresses with accented 
characters:

22:39 vaio classes$java org.apache.james.mime4j.field.address.AddressList
 > Pétér <pe...@peter.com>
Pétér <pe...@peter.com>
org.apache.james.mime4j.field.address.parser.ParseException: Lexical error at 
line 1, column 2.  Encountered: "\u00e9" (233), after : ""
         at 
org.apache.james.mime4j.field.address.parser.AddressListParser.parse(AddressListParser.java:42)
         at 
org.apache.james.mime4j.field.address.AddressList.parse(AddressList.java:116)
         at 
org.apache.james.mime4j.field.address.AddressList.main(AddressList.java:132)


I've read your remark somewhere that you're deliberately not handling Base64 or 
Quoted-Printable, but this is plain UTF-8 so that shouldn't pose a problem.

My question is simple: who should I blame ;-)

With apologies for a question from a non-Javist,
   Ondrej Bojar.

-- 
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo


Re: accented characters in e-mail addresses

Posted by Markus Wiederkehr <ma...@gmail.com>.
On Fri, Mar 27, 2009 at 9:52 AM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> On Fri, Mar 27, 2009 at 12:20 AM, Ondrej Bojar <bo...@ufal.mff.cuni.cz> wrote:
>> Dear Markus,
>>
>> thanks for the explanation.
>>
>> From this I understand that the bug is in the way Mime4j is called from K-9
>> (and Google's original Email client). Mime4j is meant for parsing header
>> fields as they arrive, that is following the appropriate RFC for MIME.
>> Mime4j is not intended for validation of header fields as they are presented
>> to (or in my case entered by) the user.
>
> one of the problems with the RFCs is that the IEFT working group
> actively excludes use cases like this which concern mail processing
> rather than mail transport. they have specific rules to be applied
> when streaming bytes from a socket which are often unreasonable or
> inconvenient in these cases.
>
> IMHO a good MIME library should be able to handle both. some encodings
> would be tricky but MIME headers should be 8-bit clean so UTF-8 should
> be reasonably straight forward.

I think in this case there is no need to deal with bytes or character
encodings because the encoded words have already been decoded and the
address has already been converted to a Java string.

But yes, Mime4j should be capable of parsing an address that contains
special characters in the "name" part. And I think in this case the
phrases should automatically be encoded into encoded words so that the
address may be used for transport.

We'd probably have to change AddressListParser.jj for that.. Currently
it has these rules:

void name_addr() :
{}
{
	phrase() angle_addr()
}

void phrase() :
{}
{
(	<DOTATOM>
|	<QUOTEDSTRING>
)+
}

.. which is very strict.

>> Is there a method in Mime4j to encode UTF-8 to the 'encoded word' =?...?=?
>> (I guess there is not.) Such a method would have to correctly handle *lists*
>> of 'decoded' addresses and not create e.g.
>>
>> =?ISO-8859-1?Q?Hans_=3Chans=40acme.org=3E,_Hans_M=FCller?=
>> <ha...@acme.org>
>>
>> from
>>
>> Hans <ha...@acme.org>, Hans Müller <ha...@acme.org>
>
> encoding then decoding seems a little unnecessary. i think a
> configuration setting (offline mode, perhaps) allowing the header
> character set to vary would be a more elegant way to support this use
> case.

I think K-9 decodes the address in order to display it to the user and
then wants to use that decoded address when it creates the reply
message.

In this particular use case decoding and re-encoding makes sense.

Markus

Re: accented characters in e-mail addresses

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Fri, Mar 27, 2009 at 12:20 AM, Ondrej Bojar <bo...@ufal.mff.cuni.cz> wrote:
> Dear Markus,
>
> thanks for the explanation.
>
> From this I understand that the bug is in the way Mime4j is called from K-9
> (and Google's original Email client). Mime4j is meant for parsing header
> fields as they arrive, that is following the appropriate RFC for MIME.
> Mime4j is not intended for validation of header fields as they are presented
> to (or in my case entered by) the user.

one of the problems with the RFCs is that the IEFT working group
actively excludes use cases like this which concern mail processing
rather than mail transport. they have specific rules to be applied
when streaming bytes from a socket which are often unreasonable or
inconvenient in these cases.

IMHO a good MIME library should be able to handle both. some encodings
would be tricky but MIME headers should be 8-bit clean so UTF-8 should
be reasonably straight forward.

> Is there a method in Mime4j to encode UTF-8 to the 'encoded word' =?...?=?
> (I guess there is not.) Such a method would have to correctly handle *lists*
> of 'decoded' addresses and not create e.g.
>
> =?ISO-8859-1?Q?Hans_=3Chans=40acme.org=3E,_Hans_M=FCller?=
> <ha...@acme.org>
>
> from
>
> Hans <ha...@acme.org>, Hans Müller <ha...@acme.org>

encoding then decoding seems a little unnecessary. i think a
configuration setting (offline mode, perhaps) allowing the header
character set to vary would be a more elegant way to support this use
case.

opinions?

- robert

BTW if anyone from K-9 wants to join this discussion and doesn't want
to sign up for mime4j-dev, please open a JIRA (issues.apache.org) and
post comments there

Re: accented characters in e-mail addresses

Posted by Markus Wiederkehr <ma...@gmail.com>.
On Fri, Mar 27, 2009 at 12:20 AM, Ondrej Bojar <bo...@ufal.mff.cuni.cz> wrote:
> Dear Markus,
>
> thanks for the explanation.
>
> From this I understand that the bug is in the way Mime4j is called from K-9
> (and Google's original Email client). Mime4j is meant for parsing header
> fields as they arrive, that is following the appropriate RFC for MIME.
> Mime4j is not intended for validation of header fields as they are presented
> to (or in my case entered by) the user.
>
> Is there a method in Mime4j to encode UTF-8 to the 'encoded word' =?...?=?
> (I guess there is not.) Such a method would have to correctly handle *lists*
> of 'decoded' addresses and not create e.g.

In Mime4j 0.6 there is
o.a.j.mime4j.codec.EncoderUtil.encodeEncodedWord() but I think we need
a better solution for this problem..

Up to release 0.5 Mime4j was mainly about parsing and decoding
messages and not so much about creating or manipulating messages. This
situation improved with 0.6 but apparently not enough. I'll open a
JIRA for parsing addresses that include special characters.

Markus


>
> =?ISO-8859-1?Q?Hans_=3Chans=40acme.org=3E,_Hans_M=FCller?=
> <ha...@acme.org>
>
> from
>
> Hans <ha...@acme.org>, Hans Müller <ha...@acme.org>
>
> Thanks, Ondrej.
>
> P.S. for myself or K-9 developers:
> com/android/email/EmailAddressValidator.java:16
>  should not call com.android.email.mail.Address.parse
>  (or com.android.email.mail.Address.parse should first encoded UTF-8 prior
> to
>  passing to to Mime4j)
>
> Markus Wiederkehr wrote:
>>
>> E-mail header fields may contain us-ascii characters only. To overcome
>> this restriction the "name" part of an e-mail address is usually
>> encoded by a mechanism called an "encoded word". Your mail client then
>> knows how to interpret these encoded words and is able to display the
>> original name.
>>
>> Just look into the source code of your e-mails to see what I mean.
>> You'll occasionally see e-mail addresses such as
>> "=?ISO-8859-1?Q?Hans_M=FCller?= <ha...@acme.org>" which is
>> equivalent to "Hans Müller <ha...@acme.org>".
>>
>> Mime4j should be capable of decoding them too..
>>
>> Markus
>>
>> On Thu, Mar 26, 2009 at 10:45 PM, Ondrej Bojar <bo...@ufal.mff.cuni.cz>
>> wrote:
>>
>>> Dear Mime4J developers,
>>>
>>> I use android and both the builtin Email client and the K-9 replacement
>>> delete e-mail addresses containing accented characters. (If say "Pétér
>>> <pe...@peter.com>" sends me an e-mail and I hit 'Reply', the 'To' field
>>> becomes blank.)
>>>
>>> I can barely read Java, but I understood from K-9 source they use your
>>> Mime4J for e-mail address parsing (and thus validation).
>>>
>>> I was not able to compile the code downloaded from your site (I know
>>> nothing
>>> about Maven, I installed it but running 'mvn test' tried to download
>>> something and failed.)
>>>
>>> I compiled K-9 (the source of which includes a version of mime4j) and I
>>> guess this exception is exactly the reason why they remove addresses with
>>> accented characters:
>>>
>>> 22:39 vaio classes$java org.apache.james.mime4j.field.address.AddressList
>>>
>>>> Pétér <pe...@peter.com>
>>>
>>> Pétér <pe...@peter.com>
>>> org.apache.james.mime4j.field.address.parser.ParseException: Lexical
>>> error
>>> at line 1, column 2.  Encountered: "\u00e9" (233), after : ""
>>>      at
>>>
>>> org.apache.james.mime4j.field.address.parser.AddressListParser.parse(AddressListParser.java:42)
>>>      at
>>>
>>> org.apache.james.mime4j.field.address.AddressList.parse(AddressList.java:116)
>>>      at
>>>
>>> org.apache.james.mime4j.field.address.AddressList.main(AddressList.java:132)
>>>
>>>
>>> I've read your remark somewhere that you're deliberately not handling
>>> Base64
>>> or Quoted-Printable, but this is plain UTF-8 so that shouldn't pose a
>>> problem.
>>>
>>> My question is simple: who should I blame ;-)
>>>
>>> With apologies for a question from a non-Javist,
>>> Ondrej Bojar.
>>>
>>> --
>>> Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
>>> http://www.cuni.cz/~obo
>
> --
> Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
> http://www.cuni.cz/~obo
>
>



-- 
Always remember you're unique. Just like everyone else.

Re: accented characters in e-mail addresses

Posted by Ondrej Bojar <bo...@ufal.mff.cuni.cz>.
Dear Markus,

thanks for the explanation.

 From this I understand that the bug is in the way Mime4j is called from K-9 
(and Google's original Email client). Mime4j is meant for parsing header fields 
as they arrive, that is following the appropriate RFC for MIME. Mime4j is not 
intended for validation of header fields as they are presented to (or in my case 
entered by) the user.

Is there a method in Mime4j to encode UTF-8 to the 'encoded word' =?...?=? (I 
guess there is not.) Such a method would have to correctly handle *lists* of 
'decoded' addresses and not create e.g.

=?ISO-8859-1?Q?Hans_=3Chans=40acme.org=3E,_Hans_M=FCller?= <ha...@acme.org>

from

Hans <ha...@acme.org>, Hans Müller <ha...@acme.org>

Thanks, Ondrej.

P.S. for myself or K-9 developers:
com/android/email/EmailAddressValidator.java:16
   should not call com.android.email.mail.Address.parse
   (or com.android.email.mail.Address.parse should first encoded UTF-8 prior to
   passing to to Mime4j)

Markus Wiederkehr wrote:
> E-mail header fields may contain us-ascii characters only. To overcome
> this restriction the "name" part of an e-mail address is usually
> encoded by a mechanism called an "encoded word". Your mail client then
> knows how to interpret these encoded words and is able to display the
> original name.
> 
> Just look into the source code of your e-mails to see what I mean.
> You'll occasionally see e-mail addresses such as
> "=?ISO-8859-1?Q?Hans_M=FCller?= <ha...@acme.org>" which is
> equivalent to "Hans Müller <ha...@acme.org>".
> 
> Mime4j should be capable of decoding them too..
> 
> Markus
> 
> On Thu, Mar 26, 2009 at 10:45 PM, Ondrej Bojar <bo...@ufal.mff.cuni.cz> wrote:
> 
>>Dear Mime4J developers,
>>
>>I use android and both the builtin Email client and the K-9 replacement
>>delete e-mail addresses containing accented characters. (If say "Pétér
>><pe...@peter.com>" sends me an e-mail and I hit 'Reply', the 'To' field
>>becomes blank.)
>>
>>I can barely read Java, but I understood from K-9 source they use your
>>Mime4J for e-mail address parsing (and thus validation).
>>
>>I was not able to compile the code downloaded from your site (I know nothing
>>about Maven, I installed it but running 'mvn test' tried to download
>>something and failed.)
>>
>>I compiled K-9 (the source of which includes a version of mime4j) and I
>>guess this exception is exactly the reason why they remove addresses with
>>accented characters:
>>
>>22:39 vaio classes$java org.apache.james.mime4j.field.address.AddressList
>>
>>>Pétér <pe...@peter.com>
>>
>>Pétér <pe...@peter.com>
>>org.apache.james.mime4j.field.address.parser.ParseException: Lexical error
>>at line 1, column 2.  Encountered: "\u00e9" (233), after : ""
>>       at
>>org.apache.james.mime4j.field.address.parser.AddressListParser.parse(AddressListParser.java:42)
>>       at
>>org.apache.james.mime4j.field.address.AddressList.parse(AddressList.java:116)
>>       at
>>org.apache.james.mime4j.field.address.AddressList.main(AddressList.java:132)
>>
>>
>>I've read your remark somewhere that you're deliberately not handling Base64
>>or Quoted-Printable, but this is plain UTF-8 so that shouldn't pose a
>>problem.
>>
>>My question is simple: who should I blame ;-)
>>
>>With apologies for a question from a non-Javist,
>> Ondrej Bojar.
>>
>>--
>>Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
>>http://www.cuni.cz/~obo

-- 
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo


Re: accented characters in e-mail addresses

Posted by Markus Wiederkehr <ma...@gmail.com>.
E-mail header fields may contain us-ascii characters only. To overcome
this restriction the "name" part of an e-mail address is usually
encoded by a mechanism called an "encoded word". Your mail client then
knows how to interpret these encoded words and is able to display the
original name.

Just look into the source code of your e-mails to see what I mean.
You'll occasionally see e-mail addresses such as
"=?ISO-8859-1?Q?Hans_M=FCller?= <ha...@acme.org>" which is
equivalent to "Hans Müller <ha...@acme.org>".

Mime4j should be capable of decoding them too..

Markus

On Thu, Mar 26, 2009 at 10:45 PM, Ondrej Bojar <bo...@ufal.mff.cuni.cz> wrote:
> Dear Mime4J developers,
>
> I use android and both the builtin Email client and the K-9 replacement
> delete e-mail addresses containing accented characters. (If say "Pétér
> <pe...@peter.com>" sends me an e-mail and I hit 'Reply', the 'To' field
> becomes blank.)
>
> I can barely read Java, but I understood from K-9 source they use your
> Mime4J for e-mail address parsing (and thus validation).
>
> I was not able to compile the code downloaded from your site (I know nothing
> about Maven, I installed it but running 'mvn test' tried to download
> something and failed.)
>
> I compiled K-9 (the source of which includes a version of mime4j) and I
> guess this exception is exactly the reason why they remove addresses with
> accented characters:
>
> 22:39 vaio classes$java org.apache.james.mime4j.field.address.AddressList
>> Pétér <pe...@peter.com>
> Pétér <pe...@peter.com>
> org.apache.james.mime4j.field.address.parser.ParseException: Lexical error
> at line 1, column 2.  Encountered: "\u00e9" (233), after : ""
>        at
> org.apache.james.mime4j.field.address.parser.AddressListParser.parse(AddressListParser.java:42)
>        at
> org.apache.james.mime4j.field.address.AddressList.parse(AddressList.java:116)
>        at
> org.apache.james.mime4j.field.address.AddressList.main(AddressList.java:132)
>
>
> I've read your remark somewhere that you're deliberately not handling Base64
> or Quoted-Printable, but this is plain UTF-8 so that shouldn't pose a
> problem.
>
> My question is simple: who should I blame ;-)
>
> With apologies for a question from a non-Javist,
>  Ondrej Bojar.
>
> --
> Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
> http://www.cuni.cz/~obo