You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@harmony.apache.org by Tony Wu <wu...@gmail.com> on 2008/04/09 05:14:36 UTC

Re: [classlib][text] regression in text module, a non-bug difference?

hmm, long time no response.

exclude these testcases to make HUT passed at r646187.

On 2/21/08, Tony Wu <wu...@gmail.com> wrote:
> A little further study.
>
> The collation is defined in CLDR. Please refer to the data in locale
> "es" [1]. There is a block describing the traditional collation. I
> quote a part of it below[2]. Let me try to explain a little bit about
> this definition.
>
> First, the term "traditional" is explicitly defined. You can also find
> the definition in UTS#35[3] which says "For a traditional-style sort
> (as in Spanish) ".
>
> Second, the data[2] indicates that the rule in traditional spanish
> locale should be ... C<ch<<<Ch<<<CH.  the tag <p> is "primary", which
> is to say the "ch" is a  base-character.
>
> The conclusion is there IS a tradition Spanish collation rule which
> has a key "ch". The question is "Is it necessary for Harmony to
> support it or just to be the same behavoir as RI?"
>
> [1]
> http://www.unicode.org/repository/*checkout*/cldr/common/collation/es.xml?rev=1.21
>
> [2]
> <collation type="traditional">
> - <rules>
> ...
>  <reset>C</reset>
>  <p>ch</p>
>  <t>Ch</t>
>  <t>CH</t>
> ...
>  </rules>
> </collation>
>
> [3]
> http://www.unicode.org/reports/tr35/
>
>
> On 2/20/08, Alexei Zakharov <al...@gmail.com> wrote:
> > ¡Buenos dìas!
> >
> > :) No, I'm not an expert in Spanish. But after reading your post I got
> > an impression that we have support for additional variant of Spanish
> > language comparing to RI. However, I've tried to find something about
> > traditional Spanish variant in ICU locale browser and found nothing. I
> > believe we should learn more about this problem before making any
> > decision.
> >
> > Regards,
> > Alexei
> >
> > 2008/2/19, Tony Wu <wu...@gmail.com>:
> > > Hi, all
> > >
> > > I'm investigating the regression[1] in text module. Actually these 5
> > > failures come down to one reason: the support of traditional Spanish
> > > charactor "ch". Following is my understanding.
> > >
> > > My fix for HARMONY-5465 makes the Locale.toString be compatible with
> > > RI. Before my commit, the toString() of the Locale with empty "contry"
> > > field has only one underscore in the output but RI has two. For
> > > instance, new Locale("es","","TRADITIONAL").toString() returns
> > > "es_TRADITIONAL" in Harmony whereas "es__TRADITIONAL" in RI. Something
> > > interesting, ICU makes use of the output of toString() as keyword to
> > > indicate its Locale instance. That is to say, the 5 testcases passes
> > > before because they have not been tested in real traditional Spanish
> > > locale so that the character "ch" was interpreted as two separate
> > > characters "c" and "h". That is why we can set the offset to 1 in our
> > > testcases. After my commit, ICU find the right Spanish locale so that
> > > its behavior is compatible with spec[2].
> > >
> > > One thing strange is that I can not get the traditional Spanish locale
> > > in RI. RI behaves the same no mater whether there is a variant
> > > "TRADITIONAL" or not. Spec does not say anything about the
> > > "traditional", but I googled to know that from 1998 the character "ch"
> > > has been cancelled in Spanish. I suppose that RI changed the behavior
> > > of Spanish locale but forgot to modify the spec accordingly.
> > >
> > > BTW for the normal Spanish Locale(new Locale("es","ES")), we have the
> > > same behavior with RI. Seems ICU supports the traditional Spanish in
> > > the form of new Locale("es","","TRADITIONAL") but RI does not. Run
> > > testcase below[3] on RI to show the differences.
> > >
> > > Is there any expert familiar with Spanish here? Neey your advice.
> > >
> > > [1]
> > > http://people.apache.org/~smishura/r628209/Windows_x86/classlib-test/
> > >
> > > [2]
> > > spec says,
> > > For example, consider the following in Spanish:
> > >
> > >  "ca" -> the first key is key('c') and second key is key('a').
> > >  "cha" -> the first key is key('ch') and second key is key('a').
> > >
> > >
> > > [3]
> > >         RuleBasedCollator rbColl = (RuleBasedCollator) Collator
> > >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> > >         String text = "cha";
> > >         CollationElementIterator iterator = rbColl
> > >                 .getCollationElementIterator(text);
> > >         int keyNum = 0;
> > >         while (iterator.next() != -1) {
> > >             keyNum++;
> > >         }
> > >         System.out.println("RI has " + keyNum + " keys");
> > >
> > >         com.ibm.icu.text.RuleBasedCollator r =
> > > (com.ibm.icu.text.RuleBasedCollator) com.ibm.icu.text.Collator
> > >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> > >         com.ibm.icu.text.CollationElementIterator it = r
> > >                 .getCollationElementIterator(text);
> > >         keyNum = 0;
> > >         while (it.next() != -1) {
> > >             keyNum++;
> > >         }
> > >         System.out.println("ICU has " + keyNum + " keys");
> > >
> > >
> > >
> > > The output is:
> > > RI has 3 keys
> > > ICU has 2 keys
> > >
> > >
> > > --
> > > Tony Wu
> > > China Software Development Lab, IBM
> > >
> >
>
>
> --
> Tony Wu
> China Software Development Lab, IBM
>


-- 
Tony Wu
China Software Development Lab, IBM