You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mime4j-dev@james.apache.org by "Sharma, Ashish" <as...@hp.com> on 2011/10/17 10:42:54 UTC

RE: Mime word not getting decoded using mime4j

Stefano,

I have faced the issue of wrong encoding with following clients:

1. Google webmail client.
2. Yahoo web mail client.
3. Aol web mail client and a lot more.

Moreover I have also found that most of the web browsers have in built algorithms to detect the character encodings (especially for South East Asian charsets) to circumvent the problems that I am facing.
So I believe such a facility should also be present in mime4j too.

Thanks
Ashish

-----Original Message-----
From: Stefano Bagnara [mailto:apache@bago.org] 
Sent: Wednesday, August 03, 2011 8:51 PM
To: mime4j-dev@james.apache.org
Subject: Re: Mime word not getting decoded using mime4j

2011/8/3 Sharma, Ashish <as...@hp.com>:
> Stefano,
>
>>>You say this "wrong charset" is reported by many commonly used mail
>>>clients and so you expect mime4j to have a workaround for that: right?
>
> Yes, you understood right.

First thing we need to identify what clients do the wrong encodings,
can you provide a list?

Stefano

> My JVM details are as follows:
>
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
>
> Thanks
> Ashish
>
> -----Original Message-----
> From: Stefano Bagnara [mailto:apache@bago.org]
> Sent: Tuesday, August 02, 2011 2:54 PM
> To: mime4j-dev@james.apache.org
> Subject: Re: Mime word not getting decoded using mime4j
>
> Hi,
>
> I'm not sure I understood the issue.
>
> If I understand it correctly mime4j is working good but your input
> string declare a wrong charset. Is this right?
>
> You say this "wrong charset" is reported by many commonly used mail
> clients and so you expect mime4j to have a workaround for that: right?
>
> What JVM are you using?
>
> Stefano
>
> 2011/8/2 Sharma, Ashish <as...@hp.com>:
>> Hi,
>>
>> I am trying to decode mime words (the original string is in Chinese characters) using DecoderUtil.decodeEncodedWords().
>>
>> Following is the sample code :
>>
>> @Test
>>        public void testEncoding() throws UnsupportedEncodingException, IOException{
>>                String str = "=?gb2312?B?ztKyu8rH1tCH+LmyrmEudHh0?=";
>>                str = str + "\r\n ";
>>                str = str + "=?gb2312?B?ztLKx9bQufrIyy50eHQ=?=";
>>                str = DecoderUtil.decodeEncodedWords(str);
>>                File file = new File("C://chinese2.txt");
>>                FileOutputStream fileOut = new FileOutputStream(file);
>>                fileOut.write(str.getBytes("gb2312"));
>>                fileOut.flush();
>>                fileOut.close();
>>
>>        }
>>
>> In above code the characters would seem to be corrupted.
>>
>> Here the problem is with the character set, most of the mail clients set the char sets to be GB2312, but actually to decode the chars correctly I had to use GB18030 in the above code. (Refer this for more info: http://stackoverflow.com/questions/3856920/character-corruption-for-chinese-simple-and-traditional-and-korean-texts)
>>
>> Following is the generalization that I had made to replace character sets sent by mail clients for correct decoding of characters :
>>
>> 1. For any of following Chinese char set:
>>
>>        iso-ir-58,chinese,gbk,cn-gb,csgb2312,csiso58gb231280,euc-cn,euc_cn,euccn,gb2312,gb_2312-80,x-EUC-CN,gb2312-1980,gb2312-80
>>
>>        replace it with : GB18030
>>
>> 2. For any of the following Korean char set:
>>
>>        5601,ksc5601-1987,ksc5601_1987,euckr,ksc5601,ksc_5601,euc_kr,csEUCKR,ks_c_5601-1987
>>
>>        replace it with :EUC-KR
>>
>> 3. for any of the following Taiwanese char set:
>>
>>        ms-874\,ms874\,windows-874\,cp874\,874\,cs874\,ibm874
>>
>>        replace it with : TIS-620
>>
>>
>> I suggest that in the "DecoderUtil.decodeEncodedWords()" method itself charset fallback should be provided.
>>
>> For more info, refer http://wiki.whatwg.org/wiki/Web_Encodings also.
>>
>> Please reply your comments.
>>
>> Thanks
>> Ashish Sharma
>>
>

Re: Mime word not getting decoded using mime4j

Posted by Stefano Bagnara <ap...@bago.org>.
2011/10/17 Sharma, Ashish <as...@hp.com>:
> Stefano,
>
> I have faced the issue of wrong encoding with following clients:
>
> 1. Google webmail client.
> 2. Yahoo web mail client.
> 3. Aol web mail client and a lot more.
> Moreover I have also found that most of the web browsers have in built algorithms to detect the character encodings (especially for South East Asian charsets) to circumvent the problems that I am facing.
> So I believe such a facility should also be present in mime4j too.

Well, I still want to see the issue and prove exactly what charset is
created by what clients. We can't simply replace charsets around
otherwise we'll introduce bugs instead of adding facilities.

E.g: you say:
-------
1. For any of following Chinese char set:

       iso-ir-58,chinese,gbk,cn-gb,csgb2312,csiso58gb231280,euc-cn,euc_cn,euccn,gb2312,gb_2312-80,x-EUC-CN,gb2312-1980,gb2312-80

       replace it with : GB18030
--------
GB18030 is a superset for some of them (maybe all of them), so maybe
it is safe to relax our decoding so to support this case. I even found
discussions in the evolution list about chinese outlook wrongly using
these charsets in headers (no references to the above webmails: maybe
this is not an issue with the webmails but an issue with the browser
used? What browser did you use to reproduce it?).

------
2. For any of the following Korean char set:

       5601,ksc5601-1987,ksc5601_1987,euckr,ksc5601,ksc_5601,euc_kr,csEUCKR,ks_c_5601-1987

       replace it with :EUC-KR
------
This doesn't make sense to me. They all are aliases for the same
charset in java, so I don't think that doing this replace will change
anything in the output. Java should already use the same decoder for
all of them. Can you provide a test to prove this replace would fix
something?

-----
3. for any of the following Taiwanese char set:

       ms-874\,ms874\,windows-874\,cp874\,874\,cs874\,ibm874

       replace it with : TIS-620
-----
In this case TIS-620 seems to be a subset of windows-874 and
windows-874 defines only one char more than TIS-620 so it seems
replacing it would lead to simply "loose" that char instead of adding
a facility.
Can you provide more informations about how you decided this "replace"
is needed? E.g: A real mime message created by a real mail client that
we can look at.

> Thanks
> Ashish

Thank you,
Stefano

> -----Original Message-----
> From: Stefano Bagnara [mailto:apache@bago.org]
> Sent: Wednesday, August 03, 2011 8:51 PM
> To: mime4j-dev@james.apache.org
> Subject: Re: Mime word not getting decoded using mime4j
>
> 2011/8/3 Sharma, Ashish <as...@hp.com>:
>> Stefano,
>>
>>>>You say this "wrong charset" is reported by many commonly used mail
>>>>clients and so you expect mime4j to have a workaround for that: right?
>>
>> Yes, you understood right.
>
> First thing we need to identify what clients do the wrong encodings,
> can you provide a list?
>
> Stefano
>
>> My JVM details are as follows:
>>
>> java version "1.6.0_24"
>> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
>> Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
>>
>> Thanks
>> Ashish
>>
>> -----Original Message-----
>> From: Stefano Bagnara [mailto:apache@bago.org]
>> Sent: Tuesday, August 02, 2011 2:54 PM
>> To: mime4j-dev@james.apache.org
>> Subject: Re: Mime word not getting decoded using mime4j
>>
>> Hi,
>>
>> I'm not sure I understood the issue.
>>
>> If I understand it correctly mime4j is working good but your input
>> string declare a wrong charset. Is this right?
>>
>> You say this "wrong charset" is reported by many commonly used mail
>> clients and so you expect mime4j to have a workaround for that: right?
>>
>> What JVM are you using?
>>
>> Stefano
>>
>> 2011/8/2 Sharma, Ashish <as...@hp.com>:
>>> Hi,
>>>
>>> I am trying to decode mime words (the original string is in Chinese characters) using DecoderUtil.decodeEncodedWords().
>>>
>>> Following is the sample code :
>>>
>>> @Test
>>>        public void testEncoding() throws UnsupportedEncodingException, IOException{
>>>                String str = "=?gb2312?B?ztKyu8rH1tCH+LmyrmEudHh0?=";
>>>                str = str + "\r\n ";
>>>                str = str + "=?gb2312?B?ztLKx9bQufrIyy50eHQ=?=";
>>>                str = DecoderUtil.decodeEncodedWords(str);
>>>                File file = new File("C://chinese2.txt");
>>>                FileOutputStream fileOut = new FileOutputStream(file);
>>>                fileOut.write(str.getBytes("gb2312"));
>>>                fileOut.flush();
>>>                fileOut.close();
>>>
>>>        }
>>>
>>> In above code the characters would seem to be corrupted.
>>>
>>> Here the problem is with the character set, most of the mail clients set the char sets to be GB2312, but actually to decode the chars correctly I had to use GB18030 in the above code. (Refer this for more info: http://stackoverflow.com/questions/3856920/character-corruption-for-chinese-simple-and-traditional-and-korean-texts)
>>>
>>> Following is the generalization that I had made to replace character sets sent by mail clients for correct decoding of characters :
>>>
>>> 1. For any of following Chinese char set:
>>>
>>>        iso-ir-58,chinese,gbk,cn-gb,csgb2312,csiso58gb231280,euc-cn,euc_cn,euccn,gb2312,gb_2312-80,x-EUC-CN,gb2312-1980,gb2312-80
>>>
>>>        replace it with : GB18030
>>>
>>> 2. For any of the following Korean char set:
>>>
>>>        5601,ksc5601-1987,ksc5601_1987,euckr,ksc5601,ksc_5601,euc_kr,csEUCKR,ks_c_5601-1987
>>>
>>>        replace it with :EUC-KR
>>>
>>> 3. for any of the following Taiwanese char set:
>>>
>>>        ms-874\,ms874\,windows-874\,cp874\,874\,cs874\,ibm874
>>>
>>>        replace it with : TIS-620
>>>
>>>
>>> I suggest that in the "DecoderUtil.decodeEncodedWords()" method itself charset fallback should be provided.
>>>
>>> For more info, refer http://wiki.whatwg.org/wiki/Web_Encodings also.
>>>
>>> Please reply your comments.
>>>
>>> Thanks
>>> Ashish Sharma
>>>
>>
>