You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by Tim Ellison <t....@gmail.com> on 2010/09/16 15:05:20 UTC

[classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

On 16/Sep/2010 12:32, Robert Muir (JIRA) wrote:
> Does this mean harmony might need these methods for its own internal
> use before ICU is available?

Yes, String is used early in the bootstrapping, and having dependencies
on ICU functionality leads to an initialization circularity.

i.e. if I simply implement String#toUpperCase(Locale) as
"return UCharacter.toUpperCase(locale, this)"
then we fail to boot with

java.nio.charset.Charset (initialization failure)
     at java/lang/String.defaultCharset (String.java:736)
     at java/lang/String.<init> (String.java:232)
     at org/apache/harmony/luni/util/Util.toString (Util.java:102)
     at java/lang/System.getPropertyList (Native Method)
     at java/lang/System.ensureProperties (System.java:546)
     at java/lang/System.<clinit> (System.java:102)

Likely because we use nio Charset in the String implementation, and that
 in turn eventually calls String.toUppercase(), in CharsetProviderImpl
lines 113 and 145.

> When doing the toLowerCase(Locale)/toUpperCase(Locale), perhaps
> String.java could do:
> 
> if (locale is not Turkish or Azeri or Lithuanian) while (ch < 0x7f) (
> just do optimized fast subtraction/addition ) ... // bail out
> completely and invoke 'UCharacter.xxx'

So I wrote a prototype version of this, that does the toUpperCase for
ascii (ch<128) in-lined (i.e. using simple math), then bailing out to
ICU for chars outside that range.

It seems to boot ok with that scheme, however, I'm running on a machine
with UK locale and running our code written in ascii text.  I'll need to
think about whether the boot sequence could ever get to the point where
it needs to uppercase non-ascii.

It may be better to take out the calls of toUpperCase from nio Charset,
and have a local method to deal with those, since there are restrictions
on the strings that can be used to name charsets.

> this might be good for performance reasons?

Yes, though hopefully ICU also optimize the simple cases too.

> And harmony itself, if it uses this method at this point, is likely
> using Locale.ENGLISH or similar for consistent behavior (filenames,
> etc) ?

The locale is set up early, but ascii is ascii no matter the locale, so
uppercasing that simpler set should be fine.

> Sorry I'm not too familiar with the codebase so I'm not sure
> if it would work. But it might speed up 'typical' lowercasing in any
> case, and as far as worst-case 2x for the "special casing": i find
> the "special" casing is going to be slow anyway: e.g. the Greek sigma
> example needs to calculate word boundaries!
> 
> the Turkish/Azeri case is trickier than the existing code, I think it
> should use UCharacter.XXX too. The reason is it has to be able to
> handle the case from SpecialCasing where the 'dotted I' is written in
> decomposed form (e.g. I + COMBINING DOT ABOVE)

Yep, I'm happy to let ICU handle these interesting cases.

Regards,
Tim

Re: [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

Posted by Tim Ellison <t....@gmail.com>.

On 16/Sep/2010 16:04, Robert Muir wrote:
> On Thu, Sep 16, 2010 at 10:50 AM, Tim Ellison <t....@gmail.com> wrote:
> 
>> The principle works ok.  I attached a patch on HARMONY-6649 to show that
>> making a local toUpperCase() method for the charset names solves the
>> circularity problem in bootstrapping, and does the right thing
>> irrespective of locale.
>>
>> There are some compatibility issues though, since on the RI
>>  String foo="foo";
>>  foo.toLowerCase() == foo
>>
>> but it doesn't if we simply use ICU's toLowerCase method :-(
>>
> 
> maybe we could ask ICU if they could fix this?

Yes, worth asking.

> i looked at their toLowerCase/toUpperCase methods and it wouldnt
> require too much hacking: their UCaseProps returns ~ch whenever ch
> 'folds to itself', so its easy to track if its 'unchanged' without
> doing a second pass/equals().

I'm deliberately not looking at the ICU impl, just so the work in
Harmony is all our own (avoid license confusion etc)

Since the uppercasing rules may depend upon context etc, I can only
imagine that we'll have to do the conversion via ICU then compare the
results to see if they are different.  It will require a pass through
the char array again though.

 public String toUpperCase(Locale locale) {
     String result = UCharacter.toUpperCase(locale, this);

     // Must return self if chars unchanged
     if (count != result.count) {
         return result;
     }
     for (int i = 0; i < count; i++) {
         if (value[offset + i] != result.value[result.offset + i]) {
             return result;
         }
     }
     return this;
 }


> they might be interested in improved compatibility also.

They may have compatibility issues the other way round, i.e. their users
expecting different objects back.

>> I also expect we need to fix String#equalsIgnoreCase() to do the right
>> thing...
>>
>
> yes, i noticed this too, we don't have to deal with Locale issues there, but
> we have to iterate codepoints / compare with Character.toLowerCase(int) and
> Character.toUpperCase(int)...

Ack

Regards,
Tim

Re: [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Sep 16, 2010 at 10:50 AM, Tim Ellison <t....@gmail.com> wrote:

>
> The principle works ok.  I attached a patch on HARMONY-6649 to show that
> making a local toUpperCase() method for the charset names solves the
> circularity problem in bootstrapping, and does the right thing
> irrespective of locale.
>
> There are some compatibility issues though, since on the RI
>  String foo="foo";
>  foo.toLowerCase() == foo
>
> but it doesn't if we simply use ICU's toLowerCase method :-(
>

maybe we could ask ICU if they could fix this? i looked at their
toLowerCase/toUpperCase methods and it wouldnt require too much hacking:
their UCaseProps returns ~ch whenever ch 'folds to itself', so its easy to
track if its 'unchanged' without doing a second pass/equals().

they might be interested in improved compatibility also.

>
> I also expect we need to fix String#equalsIgnoreCase() to do the right
> thing...
>
>
yes, i noticed this too, we don't have to deal with Locale issues there, but
we have to iterate codepoints / compare with Character.toLowerCase(int) and
Character.toUpperCase(int)...

-- 
Robert Muir
rcmuir@gmail.com

Re: [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

Posted by Tim Ellison <t....@gmail.com>.

On 16/Sep/2010 14:30, Robert Muir wrote:
> On Thu, Sep 16, 2010 at 9:26 AM, Tim Ellison <t....@gmail.com> wrote:
> 
>> Yes, I agree (though I don't know what happens when it is uppercased
>> 'properly' in Turkish)
>>
> 
> in this case it would be uppercased to US-ASCİİ (note the dotted i's)
> 
> 
>> Nah, the Charset name is obliged to be a strict subset of characters
>> [1], so I propose we just deal with the transform directly in
>> CharsetProviderImpl without asking Character.
>>
>> [1]
>>
>> http://download.oracle.com/javase/1.5.0/docs/api/java/nio/charset/Charset.html
>>
>>
> +1, seems like now i understand why this constraint exists!

The principle works ok.  I attached a patch on HARMONY-6649 to show that
making a local toUpperCase() method for the charset names solves the
circularity problem in bootstrapping, and does the right thing
irrespective of locale.

There are some compatibility issues though, since on the RI
  String foo="foo";
  foo.toLowerCase() == foo

but it doesn't if we simply use ICU's toLowerCase method :-(

I also expect we need to fix String#equalsIgnoreCase() to do the right
thing...

Regards,
Tim

Re: [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Sep 16, 2010 at 9:26 AM, Tim Ellison <t....@gmail.com> wrote:

>
> Yes, I agree (though I don't know what happens when it is uppercased
> 'properly' in Turkish)
>

in this case it would be uppercased to US-ASCİİ (note the dotted i's)


>
> Nah, the Charset name is obliged to be a strict subset of characters
> [1], so I propose we just deal with the transform directly in
> CharsetProviderImpl without asking Character.
>
> [1]
>
> http://download.oracle.com/javase/1.5.0/docs/api/java/nio/charset/Charset.html
>
>
+1, seems like now i understand why this constraint exists!


-- 
Robert Muir
rcmuir@gmail.com

Re: [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

Posted by Tim Ellison <t....@gmail.com>.

On 16/Sep/2010 14:16, Robert Muir wrote:
> On Thu, Sep 16, 2010 at 9:05 AM, Tim Ellison <t....@gmail.com> wrote:
> 
>> On 16/Sep/2010 12:32, Robert Muir (JIRA) wrote:
>>> Does this mean harmony might need these methods for its own internal
>>> use before ICU is available?
>> Yes, String is used early in the bootstrapping, and having dependencies
>> on ICU functionality leads to an initialization circularity.
>>
>> i.e. if I simply implement String#toUpperCase(Locale) as
>> "return UCharacter.toUpperCase(locale, this)"
>> then we fail to boot with
>>
>> java.nio.charset.Charset (initialization failure)
>>     at java/lang/String.defaultCharset (String.java:736)
>>     at java/lang/String.<init> (String.java:232)
>>     at org/apache/harmony/luni/util/Util.toString (Util.java:102)
>>     at java/lang/System.getPropertyList (Native Method)
>>     at java/lang/System.ensureProperties (System.java:546)
>>     at java/lang/System.<clinit> (System.java:102)
>>
>> Likely because we use nio Charset in the String implementation, and that
>>  in turn eventually calls String.toUppercase(), in CharsetProviderImpl
>> lines 113 and 145.
>
> Really I think this is a bug in CharsetProviderImpl.
> Because this calls toUpperCase with the default Locale, if i am on a Turkish
> computer, and i do Charset.forName("us-ascii"),
> its going to uppercase it turkish-style and it won't find the charset. but
> it will work fine most anywhere else.

Yes, I agree (though I don't know what happens when it is uppercased
'properly' in Turkish)

> I think this should be using Locale.ENGLISH here, as its a case-insensitive
> comparison, not intended for display.
> 
> Or, optionally something closer to Unicode's 'simple case folding' could be
> available for situations like this, that simply iterates thru the string and
> does the Locale-insensitive Character.toLowerCase(i) on each character.

Nah, the Charset name is obliged to be a strict subset of characters
[1], so I propose we just deal with the transform directly in
CharsetProviderImpl without asking Character.

[1]
http://download.oracle.com/javase/1.5.0/docs/api/java/nio/charset/Charset.html

Regards,
Tim

Re: [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Sep 16, 2010 at 9:05 AM, Tim Ellison <t....@gmail.com> wrote:

> On 16/Sep/2010 12:32, Robert Muir (JIRA) wrote:
> > Does this mean harmony might need these methods for its own internal
> > use before ICU is available?
>
> Yes, String is used early in the bootstrapping, and having dependencies
> on ICU functionality leads to an initialization circularity.
>
> i.e. if I simply implement String#toUpperCase(Locale) as
> "return UCharacter.toUpperCase(locale, this)"
> then we fail to boot with
>
> java.nio.charset.Charset (initialization failure)
>     at java/lang/String.defaultCharset (String.java:736)
>     at java/lang/String.<init> (String.java:232)
>     at org/apache/harmony/luni/util/Util.toString (Util.java:102)
>     at java/lang/System.getPropertyList (Native Method)
>     at java/lang/System.ensureProperties (System.java:546)
>     at java/lang/System.<clinit> (System.java:102)
>
> Likely because we use nio Charset in the String implementation, and that
>  in turn eventually calls String.toUppercase(), in CharsetProviderImpl
> lines 113 and 145.
>
>
Really I think this is a bug in CharsetProviderImpl.
Because this calls toUpperCase with the default Locale, if i am on a Turkish
computer, and i do Charset.forName("us-ascii"),
its going to uppercase it turkish-style and it won't find the charset. but
it will work fine most anywhere else.

I think this should be using Locale.ENGLISH here, as its a case-insensitive
comparison, not intended for display.

Or, optionally something closer to Unicode's 'simple case folding' could be
available for situations like this, that simply iterates thru the string and
does the Locale-insensitive Character.toLowerCase(i) on each character.


-- 
Robert Muir
rcmuir@gmail.com