You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by Stepan Mishura <st...@gmail.com> on 2006/03/24 07:56:46 UTC

[bug-to-bug] UTF-8: interpreting non-shortest forms

According to Unicode standart 4.0 (since 3.0) interpretation of non-shortest
forms is forbidden for UTF-8. So if a byte sequence is not in table of
well-formed UTF-8 byte sequences then it is considered as ill-formed and
treated as error. Harmony follows Unicode spec. but RI doesn't. I didn't
find in the spec. explanation but I assume it is caused by backward
compatibility.

The following example demonstrates the difference. For example, code point
'1071' should be represented by the next UTF-8 byte sequence <D0 AF>. But it
may be represented as 3 bytes sequence: <E0 90 AF> that is its non-shortest
form. So the following code prints "ERROR" on Harmony implementation and "Ok
with non-shortest forms" on RI

        String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, (byte)
0xAF}, "UTF-8");
        String s2 = new String(new char[]{1071});

        if(s1.equals(s2)){
            System.out.println("Ok with non-shortest forms");
        } else {
            System.out.println("ERROR");
        }

We should decide whether we going to be compatible with RI or Unicode spec.

Thanks,
Stepan Mishura
Intel Middleware Products Division

Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Posted by Richard Liang <ri...@gmail.com>.

Stepan Mishura wrote:
> On 3/27/06, Richard Liang wrote:
>   
>> Nathan Beyer wrote:
>>     
>>> I've seen similar differences between other VMs around the handling of
>>>       
>> UTF-8
>>     
>>> encoded data, especially between Sun and IBM VMs.  For example, if you
>>>       
>> read
>>     
>>> a file with a UTF-8 encoding that contains an invalid byte(s), the IBM
>>>       
>> VM
>>     
>>> will throw an IOException, but the Sun VM will convert the invalid
>>>       
>> byte(s)
>>     
>>> into the Unicode unknown character (diamond-backed-question-mark).
>>>
>>> Personally, I prefer VMs that explicitly stick to Unicode and the
>>>       
>> various
>>     
>>> encodings and indicate error conditions.
>>>
>>>
>>>       
>> Hello Nathan,
>>
>> +1, we shall stick to Unicode and various encodings.
>>     
>
>
>
> For me it is not obvious and I cannot make the choice.
> Let's review the next theoretical situation: if the next Unicode spec.
> update or corrigendum will require update that break Harmony backward
> compatibility. Should we stick to the new Unicode version or be backward
> compatible?
>
>   
Hello Stepan,

For this situation, we may have three options:
1. Compliant with the new version of Unicode Spec
2. Compliant with the original version of Unicode Spec
3. Compliant with the new version of Unicode Spec and simultaneously 
keep some violation

I think 1 & 2 may be the proper answers, but 3 is not.

Let's think why we support Unicode. IMHO, it's because Unicode is a 
bridge to ensure interoperability of applications from different 
encoding system. If we announce that we support one version of Unicode 
and simultaneously keep some violation. How can we ensure the 
interoperability with other applications? 
> Thanks,
> Stepan.
>
>   
>> -Nathan
>>     
>>>       
>>>> -----Original Message-----
>>>> From: Stepan Mishura [mailto:stepan.mishura@gmail.com]
>>>> Sent: Friday, March 24, 2006 12:57 AM
>>>> To: harmony-dev
>>>> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
>>>>
>>>> According to Unicode standart 4.0 (since 3.0) interpretation of non-
>>>> shortest
>>>> forms is forbidden for UTF-8. So if a byte sequence is not in table of
>>>> well-formed UTF-8 byte sequences then it is considered as ill-formed
>>>>         
>> and
>>     
>>>> treated as error. Harmony follows Unicode spec. but RI doesn't. I
>>>>         
>> didn't
>>     
>>>> find in the spec. explanation but I assume it is caused by backward
>>>> compatibility.
>>>>
>>>> The following example demonstrates the difference. For example, code
>>>>         
>> point
>>     
>>>> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>.
>>>>         
>> But
>>     
>>>> it
>>>> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
>>>> shortest
>>>> form. So the following code prints "ERROR" on Harmony implementation
>>>>         
>> and
>>     
>>>> "Ok
>>>> with non-shortest forms" on RI
>>>>
>>>>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90,
>>>>         
>> (byte)
>>     
>>>> 0xAF}, "UTF-8");
>>>>         String s2 = new String(new char[]{1071});
>>>>
>>>>         if(s1.equals(s2)){
>>>>             System.out.println("Ok with non-shortest forms");
>>>>         } else {
>>>>             System.out.println("ERROR");
>>>>         }
>>>>
>>>> We should decide whether we going to be compatible with RI or Unicode
>>>> spec.
>>>>
>>>> Thanks,
>>>> Stepan Mishura
>>>> Intel Middleware Products Division
>>>>
>>>>         
>>>
>>>       
>>
>>     
>
>
> --
> Thanks,
> Stepan Mishura
> Intel Middleware Products Division
>
>   


-- 
Richard Liang
China Software Development Lab, IBM

Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Posted by Stepan Mishura <st...@gmail.com>.

On 3/27/06, Richard Liang wrote:
>
> Nathan Beyer wrote:
> > I've seen similar differences between other VMs around the handling of
> UTF-8
> > encoded data, especially between Sun and IBM VMs.  For example, if you
> read
> > a file with a UTF-8 encoding that contains an invalid byte(s), the IBM
> VM
> > will throw an IOException, but the Sun VM will convert the invalid
> byte(s)
> > into the Unicode unknown character (diamond-backed-question-mark).
> >
> > Personally, I prefer VMs that explicitly stick to Unicode and the
> various
> > encodings and indicate error conditions.
> >
> >
> Hello Nathan,
>
> +1, we shall stick to Unicode and various encodings.



For me it is not obvious and I cannot make the choice.
Let's review the next theoretical situation: if the next Unicode spec.
update or corrigendum will require update that break Harmony backward
compatibility. Should we stick to the new Unicode version or be backward
compatible?

Thanks,
Stepan.

> -Nathan
> >
> >
> >> -----Original Message-----
> >> From: Stepan Mishura [mailto:stepan.mishura@gmail.com]
> >> Sent: Friday, March 24, 2006 12:57 AM
> >> To: harmony-dev
> >> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
> >>
> >> According to Unicode standart 4.0 (since 3.0) interpretation of non-
> >> shortest
> >> forms is forbidden for UTF-8. So if a byte sequence is not in table of
> >> well-formed UTF-8 byte sequences then it is considered as ill-formed
> and
> >> treated as error. Harmony follows Unicode spec. but RI doesn't. I
> didn't
> >> find in the spec. explanation but I assume it is caused by backward
> >> compatibility.
> >>
> >> The following example demonstrates the difference. For example, code
> point
> >> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>.
> But
> >> it
> >> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
> >> shortest
> >> form. So the following code prints "ERROR" on Harmony implementation
> and
> >> "Ok
> >> with non-shortest forms" on RI
> >>
> >>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90,
> (byte)
> >> 0xAF}, "UTF-8");
> >>         String s2 = new String(new char[]{1071});
> >>
> >>         if(s1.equals(s2)){
> >>             System.out.println("Ok with non-shortest forms");
> >>         } else {
> >>             System.out.println("ERROR");
> >>         }
> >>
> >> We should decide whether we going to be compatible with RI or Unicode
> >> spec.
> >>
> >> Thanks,
> >> Stepan Mishura
> >> Intel Middleware Products Division
> >>
> >
> >
> >
>
>
>


--
Thanks,
Stepan Mishura
Intel Middleware Products Division

Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Posted by Richard Liang <ri...@gmail.com>.

Nathan Beyer wrote:
> I've seen similar differences between other VMs around the handling of UTF-8
> encoded data, especially between Sun and IBM VMs.  For example, if you read
> a file with a UTF-8 encoding that contains an invalid byte(s), the IBM VM
> will throw an IOException, but the Sun VM will convert the invalid byte(s)
> into the Unicode unknown character (diamond-backed-question-mark).
>
> Personally, I prefer VMs that explicitly stick to Unicode and the various
> encodings and indicate error conditions.
>
>   
Hello Nathan,

+1, we shall stick to Unicode and various encodings.
> -Nathan
>
>   
>> -----Original Message-----
>> From: Stepan Mishura [mailto:stepan.mishura@gmail.com]
>> Sent: Friday, March 24, 2006 12:57 AM
>> To: harmony-dev
>> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
>>
>> According to Unicode standart 4.0 (since 3.0) interpretation of non-
>> shortest
>> forms is forbidden for UTF-8. So if a byte sequence is not in table of
>> well-formed UTF-8 byte sequences then it is considered as ill-formed and
>> treated as error. Harmony follows Unicode spec. but RI doesn't. I didn't
>> find in the spec. explanation but I assume it is caused by backward
>> compatibility.
>>
>> The following example demonstrates the difference. For example, code point
>> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>. But
>> it
>> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
>> shortest
>> form. So the following code prints "ERROR" on Harmony implementation and
>> "Ok
>> with non-shortest forms" on RI
>>
>>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, (byte)
>> 0xAF}, "UTF-8");
>>         String s2 = new String(new char[]{1071});
>>
>>         if(s1.equals(s2)){
>>             System.out.println("Ok with non-shortest forms");
>>         } else {
>>             System.out.println("ERROR");
>>         }
>>
>> We should decide whether we going to be compatible with RI or Unicode
>> spec.
>>
>> Thanks,
>> Stepan Mishura
>> Intel Middleware Products Division
>>     
>
>
>

RE: [bug-to-bug] UTF-8: interpreting non-shortest forms

Posted by Nathan Beyer <nb...@kc.rr.com>.

I've seen similar differences between other VMs around the handling of UTF-8
encoded data, especially between Sun and IBM VMs.  For example, if you read
a file with a UTF-8 encoding that contains an invalid byte(s), the IBM VM
will throw an IOException, but the Sun VM will convert the invalid byte(s)
into the Unicode unknown character (diamond-backed-question-mark).

Personally, I prefer VMs that explicitly stick to Unicode and the various
encodings and indicate error conditions.

-Nathan

> -----Original Message-----
> From: Stepan Mishura [mailto:stepan.mishura@gmail.com]
> Sent: Friday, March 24, 2006 12:57 AM
> To: harmony-dev
> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
> 
> According to Unicode standart 4.0 (since 3.0) interpretation of non-
> shortest
> forms is forbidden for UTF-8. So if a byte sequence is not in table of
> well-formed UTF-8 byte sequences then it is considered as ill-formed and
> treated as error. Harmony follows Unicode spec. but RI doesn't. I didn't
> find in the spec. explanation but I assume it is caused by backward
> compatibility.
> 
> The following example demonstrates the difference. For example, code point
> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>. But
> it
> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
> shortest
> form. So the following code prints "ERROR" on Harmony implementation and
> "Ok
> with non-shortest forms" on RI
> 
>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, (byte)
> 0xAF}, "UTF-8");
>         String s2 = new String(new char[]{1071});
> 
>         if(s1.equals(s2)){
>             System.out.println("Ok with non-shortest forms");
>         } else {
>             System.out.println("ERROR");
>         }
> 
> We should decide whether we going to be compatible with RI or Unicode
> spec.
> 
> Thanks,
> Stepan Mishura
> Intel Middleware Products Division