You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by Tony Wu <wu...@gmail.com> on 2008/02/19 17:20:12 UTC

[classlib][text] regression in text module, a non-bug difference?

Hi, all

I'm investigating the regression[1] in text module. Actually these 5
failures come down to one reason: the support of traditional Spanish
charactor "ch". Following is my understanding.

My fix for HARMONY-5465 makes the Locale.toString be compatible with
RI. Before my commit, the toString() of the Locale with empty "contry"
field has only one underscore in the output but RI has two. For
instance, new Locale("es","","TRADITIONAL").toString() returns
"es_TRADITIONAL" in Harmony whereas "es__TRADITIONAL" in RI. Something
interesting, ICU makes use of the output of toString() as keyword to
indicate its Locale instance. That is to say, the 5 testcases passes
before because they have not been tested in real traditional Spanish
locale so that the character "ch" was interpreted as two separate
characters "c" and "h". That is why we can set the offset to 1 in our
testcases. After my commit, ICU find the right Spanish locale so that
its behavior is compatible with spec[2].

One thing strange is that I can not get the traditional Spanish locale
in RI. RI behaves the same no mater whether there is a variant
"TRADITIONAL" or not. Spec does not say anything about the
"traditional", but I googled to know that from 1998 the character "ch"
has been cancelled in Spanish. I suppose that RI changed the behavior
of Spanish locale but forgot to modify the spec accordingly.

BTW for the normal Spanish Locale(new Locale("es","ES")), we have the
same behavior with RI. Seems ICU supports the traditional Spanish in
the form of new Locale("es","","TRADITIONAL") but RI does not. Run
testcase below[3] on RI to show the differences.

Is there any expert familiar with Spanish here? Neey your advice.

[1]
http://people.apache.org/~smishura/r628209/Windows_x86/classlib-test/

[2]
spec says,
For example, consider the following in Spanish:

 "ca" -> the first key is key('c') and second key is key('a').
 "cha" -> the first key is key('ch') and second key is key('a').


[3]
        RuleBasedCollator rbColl = (RuleBasedCollator) Collator
                .getInstance(new Locale("es", "", "TRADITIONAL"));
        String text = "cha";
        CollationElementIterator iterator = rbColl
                .getCollationElementIterator(text);
        int keyNum = 0;
        while (iterator.next() != -1) {
            keyNum++;
        }
        System.out.println("RI has " + keyNum + " keys");

        com.ibm.icu.text.RuleBasedCollator r =
(com.ibm.icu.text.RuleBasedCollator) com.ibm.icu.text.Collator
                .getInstance(new Locale("es", "", "TRADITIONAL"));
        com.ibm.icu.text.CollationElementIterator it = r
                .getCollationElementIterator(text);
        keyNum = 0;
        while (it.next() != -1) {
            keyNum++;
        }
        System.out.println("ICU has " + keyNum + " keys");



The output is:
RI has 3 keys
ICU has 2 keys


-- 
Tony Wu
China Software Development Lab, IBM

Re: [classlib][text] regression in text module, a non-bug difference?

Posted by Tony Wu <wu...@gmail.com>.

hmm, long time no response.

exclude these testcases to make HUT passed at r646187.

On 2/21/08, Tony Wu <wu...@gmail.com> wrote:
> A little further study.
>
> The collation is defined in CLDR. Please refer to the data in locale
> "es" [1]. There is a block describing the traditional collation. I
> quote a part of it below[2]. Let me try to explain a little bit about
> this definition.
>
> First, the term "traditional" is explicitly defined. You can also find
> the definition in UTS#35[3] which says "For a traditional-style sort
> (as in Spanish) ".
>
> Second, the data[2] indicates that the rule in traditional spanish
> locale should be ... C<ch<<<Ch<<<CH.  the tag <p> is "primary", which
> is to say the "ch" is a  base-character.
>
> The conclusion is there IS a tradition Spanish collation rule which
> has a key "ch". The question is "Is it necessary for Harmony to
> support it or just to be the same behavoir as RI?"
>
> [1]
> http://www.unicode.org/repository/*checkout*/cldr/common/collation/es.xml?rev=1.21
>
> [2]
> <collation type="traditional">
> - <rules>
> ...
>  <reset>C</reset>
>  <p>ch</p>
>  <t>Ch</t>
>  <t>CH</t>
> ...
>  </rules>
> </collation>
>
> [3]
> http://www.unicode.org/reports/tr35/
>
>
> On 2/20/08, Alexei Zakharov <al...@gmail.com> wrote:
> > ¡Buenos dìas!
> >
> > :) No, I'm not an expert in Spanish. But after reading your post I got
> > an impression that we have support for additional variant of Spanish
> > language comparing to RI. However, I've tried to find something about
> > traditional Spanish variant in ICU locale browser and found nothing. I
> > believe we should learn more about this problem before making any
> > decision.
> >
> > Regards,
> > Alexei
> >
> > 2008/2/19, Tony Wu <wu...@gmail.com>:
> > > Hi, all
> > >
> > > I'm investigating the regression[1] in text module. Actually these 5
> > > failures come down to one reason: the support of traditional Spanish
> > > charactor "ch". Following is my understanding.
> > >
> > > My fix for HARMONY-5465 makes the Locale.toString be compatible with
> > > RI. Before my commit, the toString() of the Locale with empty "contry"
> > > field has only one underscore in the output but RI has two. For
> > > instance, new Locale("es","","TRADITIONAL").toString() returns
> > > "es_TRADITIONAL" in Harmony whereas "es__TRADITIONAL" in RI. Something
> > > interesting, ICU makes use of the output of toString() as keyword to
> > > indicate its Locale instance. That is to say, the 5 testcases passes
> > > before because they have not been tested in real traditional Spanish
> > > locale so that the character "ch" was interpreted as two separate
> > > characters "c" and "h". That is why we can set the offset to 1 in our
> > > testcases. After my commit, ICU find the right Spanish locale so that
> > > its behavior is compatible with spec[2].
> > >
> > > One thing strange is that I can not get the traditional Spanish locale
> > > in RI. RI behaves the same no mater whether there is a variant
> > > "TRADITIONAL" or not. Spec does not say anything about the
> > > "traditional", but I googled to know that from 1998 the character "ch"
> > > has been cancelled in Spanish. I suppose that RI changed the behavior
> > > of Spanish locale but forgot to modify the spec accordingly.
> > >
> > > BTW for the normal Spanish Locale(new Locale("es","ES")), we have the
> > > same behavior with RI. Seems ICU supports the traditional Spanish in
> > > the form of new Locale("es","","TRADITIONAL") but RI does not. Run
> > > testcase below[3] on RI to show the differences.
> > >
> > > Is there any expert familiar with Spanish here? Neey your advice.
> > >
> > > [1]
> > > http://people.apache.org/~smishura/r628209/Windows_x86/classlib-test/
> > >
> > > [2]
> > > spec says,
> > > For example, consider the following in Spanish:
> > >
> > >  "ca" -> the first key is key('c') and second key is key('a').
> > >  "cha" -> the first key is key('ch') and second key is key('a').
> > >
> > >
> > > [3]
> > >         RuleBasedCollator rbColl = (RuleBasedCollator) Collator
> > >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> > >         String text = "cha";
> > >         CollationElementIterator iterator = rbColl
> > >                 .getCollationElementIterator(text);
> > >         int keyNum = 0;
> > >         while (iterator.next() != -1) {
> > >             keyNum++;
> > >         }
> > >         System.out.println("RI has " + keyNum + " keys");
> > >
> > >         com.ibm.icu.text.RuleBasedCollator r =
> > > (com.ibm.icu.text.RuleBasedCollator) com.ibm.icu.text.Collator
> > >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> > >         com.ibm.icu.text.CollationElementIterator it = r
> > >                 .getCollationElementIterator(text);
> > >         keyNum = 0;
> > >         while (it.next() != -1) {
> > >             keyNum++;
> > >         }
> > >         System.out.println("ICU has " + keyNum + " keys");
> > >
> > >
> > >
> > > The output is:
> > > RI has 3 keys
> > > ICU has 2 keys
> > >
> > >
> > > --
> > > Tony Wu
> > > China Software Development Lab, IBM
> > >
> >
>
>
> --
> Tony Wu
> China Software Development Lab, IBM
>


-- 
Tony Wu
China Software Development Lab, IBM

Re: [classlib][text] regression in text module, a non-bug difference?

Posted by Tony Wu <wu...@gmail.com>.

A little further study.

The collation is defined in CLDR. Please refer to the data in locale
"es" [1]. There is a block describing the traditional collation. I
quote a part of it below[2]. Let me try to explain a little bit about
this definition.

First, the term "traditional" is explicitly defined. You can also find
the definition in UTS#35[3] which says "For a traditional-style sort
(as in Spanish) ".

Second, the data[2] indicates that the rule in traditional spanish
locale should be ... C<ch<<<Ch<<<CH.  the tag <p> is "primary", which
is to say the "ch" is a  base-character.

The conclusion is there IS a tradition Spanish collation rule which
has a key "ch". The question is "Is it necessary for Harmony to
support it or just to be the same behavoir as RI?"

[1]
http://www.unicode.org/repository/*checkout*/cldr/common/collation/es.xml?rev=1.21

[2]
<collation type="traditional">
- <rules>
...
  <reset>C</reset>
  <p>ch</p>
  <t>Ch</t>
  <t>CH</t>
...
  </rules>
</collation>

[3]
http://www.unicode.org/reports/tr35/


On 2/20/08, Alexei Zakharov <al...@gmail.com> wrote:
> ¡Buenos dìas!
>
> :) No, I'm not an expert in Spanish. But after reading your post I got
> an impression that we have support for additional variant of Spanish
> language comparing to RI. However, I've tried to find something about
> traditional Spanish variant in ICU locale browser and found nothing. I
> believe we should learn more about this problem before making any
> decision.
>
> Regards,
> Alexei
>
> 2008/2/19, Tony Wu <wu...@gmail.com>:
> > Hi, all
> >
> > I'm investigating the regression[1] in text module. Actually these 5
> > failures come down to one reason: the support of traditional Spanish
> > charactor "ch". Following is my understanding.
> >
> > My fix for HARMONY-5465 makes the Locale.toString be compatible with
> > RI. Before my commit, the toString() of the Locale with empty "contry"
> > field has only one underscore in the output but RI has two. For
> > instance, new Locale("es","","TRADITIONAL").toString() returns
> > "es_TRADITIONAL" in Harmony whereas "es__TRADITIONAL" in RI. Something
> > interesting, ICU makes use of the output of toString() as keyword to
> > indicate its Locale instance. That is to say, the 5 testcases passes
> > before because they have not been tested in real traditional Spanish
> > locale so that the character "ch" was interpreted as two separate
> > characters "c" and "h". That is why we can set the offset to 1 in our
> > testcases. After my commit, ICU find the right Spanish locale so that
> > its behavior is compatible with spec[2].
> >
> > One thing strange is that I can not get the traditional Spanish locale
> > in RI. RI behaves the same no mater whether there is a variant
> > "TRADITIONAL" or not. Spec does not say anything about the
> > "traditional", but I googled to know that from 1998 the character "ch"
> > has been cancelled in Spanish. I suppose that RI changed the behavior
> > of Spanish locale but forgot to modify the spec accordingly.
> >
> > BTW for the normal Spanish Locale(new Locale("es","ES")), we have the
> > same behavior with RI. Seems ICU supports the traditional Spanish in
> > the form of new Locale("es","","TRADITIONAL") but RI does not. Run
> > testcase below[3] on RI to show the differences.
> >
> > Is there any expert familiar with Spanish here? Neey your advice.
> >
> > [1]
> > http://people.apache.org/~smishura/r628209/Windows_x86/classlib-test/
> >
> > [2]
> > spec says,
> > For example, consider the following in Spanish:
> >
> >  "ca" -> the first key is key('c') and second key is key('a').
> >  "cha" -> the first key is key('ch') and second key is key('a').
> >
> >
> > [3]
> >         RuleBasedCollator rbColl = (RuleBasedCollator) Collator
> >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> >         String text = "cha";
> >         CollationElementIterator iterator = rbColl
> >                 .getCollationElementIterator(text);
> >         int keyNum = 0;
> >         while (iterator.next() != -1) {
> >             keyNum++;
> >         }
> >         System.out.println("RI has " + keyNum + " keys");
> >
> >         com.ibm.icu.text.RuleBasedCollator r =
> > (com.ibm.icu.text.RuleBasedCollator) com.ibm.icu.text.Collator
> >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> >         com.ibm.icu.text.CollationElementIterator it = r
> >                 .getCollationElementIterator(text);
> >         keyNum = 0;
> >         while (it.next() != -1) {
> >             keyNum++;
> >         }
> >         System.out.println("ICU has " + keyNum + " keys");
> >
> >
> >
> > The output is:
> > RI has 3 keys
> > ICU has 2 keys
> >
> >
> > --
> > Tony Wu
> > China Software Development Lab, IBM
> >
>


-- 
Tony Wu
China Software Development Lab, IBM

Re: [classlib][text] regression in text module, a non-bug difference?

Posted by Alexei Zakharov <al...@gmail.com>.

¡Buenos dìas!

:) No, I'm not an expert in Spanish. But after reading your post I got
an impression that we have support for additional variant of Spanish
language comparing to RI. However, I've tried to find something about
traditional Spanish variant in ICU locale browser and found nothing. I
believe we should learn more about this problem before making any
decision.

Regards,
Alexei

2008/2/19, Tony Wu <wu...@gmail.com>:
> Hi, all
>
> I'm investigating the regression[1] in text module. Actually these 5
> failures come down to one reason: the support of traditional Spanish
> charactor "ch". Following is my understanding.
>
> My fix for HARMONY-5465 makes the Locale.toString be compatible with
> RI. Before my commit, the toString() of the Locale with empty "contry"
> field has only one underscore in the output but RI has two. For
> instance, new Locale("es","","TRADITIONAL").toString() returns
> "es_TRADITIONAL" in Harmony whereas "es__TRADITIONAL" in RI. Something
> interesting, ICU makes use of the output of toString() as keyword to
> indicate its Locale instance. That is to say, the 5 testcases passes
> before because they have not been tested in real traditional Spanish
> locale so that the character "ch" was interpreted as two separate
> characters "c" and "h". That is why we can set the offset to 1 in our
> testcases. After my commit, ICU find the right Spanish locale so that
> its behavior is compatible with spec[2].
>
> One thing strange is that I can not get the traditional Spanish locale
> in RI. RI behaves the same no mater whether there is a variant
> "TRADITIONAL" or not. Spec does not say anything about the
> "traditional", but I googled to know that from 1998 the character "ch"
> has been cancelled in Spanish. I suppose that RI changed the behavior
> of Spanish locale but forgot to modify the spec accordingly.
>
> BTW for the normal Spanish Locale(new Locale("es","ES")), we have the
> same behavior with RI. Seems ICU supports the traditional Spanish in
> the form of new Locale("es","","TRADITIONAL") but RI does not. Run
> testcase below[3] on RI to show the differences.
>
> Is there any expert familiar with Spanish here? Neey your advice.
>
> [1]
> http://people.apache.org/~smishura/r628209/Windows_x86/classlib-test/
>
> [2]
> spec says,
> For example, consider the following in Spanish:
>
>  "ca" -> the first key is key('c') and second key is key('a').
>  "cha" -> the first key is key('ch') and second key is key('a').
>
>
> [3]
>         RuleBasedCollator rbColl = (RuleBasedCollator) Collator
>                 .getInstance(new Locale("es", "", "TRADITIONAL"));
>         String text = "cha";
>         CollationElementIterator iterator = rbColl
>                 .getCollationElementIterator(text);
>         int keyNum = 0;
>         while (iterator.next() != -1) {
>             keyNum++;
>         }
>         System.out.println("RI has " + keyNum + " keys");
>
>         com.ibm.icu.text.RuleBasedCollator r =
> (com.ibm.icu.text.RuleBasedCollator) com.ibm.icu.text.Collator
>                 .getInstance(new Locale("es", "", "TRADITIONAL"));
>         com.ibm.icu.text.CollationElementIterator it = r
>                 .getCollationElementIterator(text);
>         keyNum = 0;
>         while (it.next() != -1) {
>             keyNum++;
>         }
>         System.out.println("ICU has " + keyNum + " keys");
>
>
>
> The output is:
> RI has 3 keys
> ICU has 2 keys
>
>
> --
> Tony Wu
> China Software Development Lab, IBM
>