You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by George Aroush <ge...@aroush.net> on 2006/12/13 05:45:42 UTC

Sort differences between .NET and Java in Lucene.Net 2.0

Hi folks,

One of the remaining issues with Lucene.Net 2.0 is two tests that are
failing, TestInternationalMultiSearcherSort and TestInternationalSort.

After few hours of debugging, I discovered that in C#, "H\u00D8T" < "HUT"
but in Java, "H\u00D8T" > "HUT" (here, "H\u00D8T" is in Unicode and is
actually "Ø")

The culture-info / local used are, in C# "en-US" and in Java "Locale.US".

The fail point occurs because, I think, System.Globalization.CompareInfo is
not treating the string as Unicode; "\u00D8" is being treated as ASCII "O".
If that's the case, how do I tell .NET to use Unicode?

IF you know why .NET is behaving differently here, please let me know.

Regards,

-- George Aroush

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by George Aroush <ge...@aroush.net>.

Hi Dean,

No, I do not intened to use CompareOrdinal -- that would break Lucene.Net.

I have posted this question on Java Lucene mailing list; I got one response
suggesting that Java is doing it wrong.  I am certain about this.

I have done some more research, and so far, I am agreeing with your
analyses.  For example, like you said, using the Danish locale gave me the
same result with Java and .NET.

Does everyone agree that this is not an issue with Lucene.Net 2.0 such that
I should release 2.0 as "final"?  I should point out that this same problem
also existed in 1.9.1, 1.9 1.4.3, 1.4 and earlier releases.  In those
releases, this test didn't exist to expose it.

Regards,

-- George Aroush


-----Original Message-----
From: Dean Harding [mailto:dean.harding@dload.com.au] 
Sent: Wednesday, December 13, 2006 9:39 PM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

You certainly don't want CompareOrdinal!

.NET is doing the right thing in this case, but so is Java. 

The problem is that Ø is not a character that is used in US English (or any
English, for that matter), so the actual order that would be returned when
doing a compare in a locale like en-us is not really important.

What IS important is if you do the comparison in the context of a locale
that DOES use the Ø character. If you change your .NET culture name or your
Java locale to (for example) "da" (that is, Danish) then the results are the
same.

So the bug, I believe, is in the test case which is relying on which is, in
my opinion, undefined.

Dean.


> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Thursday, 14 December 2006 1:07 pm
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi Joe and all,
> 
> I don't think we can use CompareOrdinal() as it doesn't take locale 
> into consideration.
> 
> The issue is with the following function in
> Lucene.Net.Search.FieldSortedHitQueue.cs:
> 
>     public int Compare(ScoreDoc i, ScoreDoc j)
>     {
>         return collator.Compare(index[i.doc].ToString(),
> index[j.doc].ToString());
>     }
> 
> To demonstrate how Java and C# differ in the way they do compare, here 
> is a sample code:
> 
>     // C# code: you get back -1 for 'res'
>     string s1 = "H\u00D8T";
>     string s2 = "HUT";
>     System.Globalization.CultureInfo locale = new 
> System.Globalization.CultureInfo("en-US");
>     System.Globalization.CompareInfo collator = locale.CompareInfo;
>     int res = collator.Compare(s1, s2);
> 
>     // Java code: you get back 1 for 'res'
>     String s1 = "H\u00D8T";
>     String s2 = "HUT";
>     Collator collator = Collator.getInstance (Locale.US);
>     int diff = collator.compare(s1, s2);
> 
> Who is doing the right thing?  Or am I missing additional calls before 
> I can compare?
> 
> My goal is to understand why the difference exist and thus we can 
> judge how serious this is and either fix it or accept it as a language 
> difference.
> 
> Btw, I am going to post this question on the Java Lucene mailing list 
> to see what folks on the Java land have to say.
> 
> Regards,
> 
> -- George Aroush
> 
> 
> -----Original Message-----
> From: Joe Shaw [mailto:joeshaw@novell.com]
> Sent: Wednesday, December 13, 2006 1:35 PM
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi,
> 
> On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> > This is why those two tests are failing and I wander if this is a 
> > defect in NET or in the way the culture info is used in those two 
> > languages or if there is more culture setting I have to do in .NET.
> >
> > My thinking is, in .NET during compare, "\u00D8", is being treated 
> > as ASCII "O" and not the Unicode character that it really is.
> 
> This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
> 
> I think that the sort order is just different between .NET and Java -- 
> ie, the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in 
> Java -- at least in the culture you're using.
> 
> If you're looking for the actual numerical values of the characters 
> for comparison (in which "\u00D8" would be quite a bit higher than both
"O"
> and "U", you probably want to use String.CompareOrdinal()).
> 
> BTW, doing culture insensitive string comparisons might be a good 
> thing to do anyway.  From the MSDN docs for String.Compare(string,
string):
> 
>         The comparison uses the current culture to obtain
>         culture-specific information such as casing rules and the
>         alphabetic order of individual characters. For example, a
>         culture could specify that certain combinations of characters be
>         treated as a single character, or uppercase and lowercase
>         characters be compared in a particular way, or that the sorting
>         order of a character depends on the characters that precede or
>         follow it.
> 
> For more info, see the String.Compare() docs:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/cpref/html/
> frlrfsystemStringclassComparetopic.asp
> 
> Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by George Aroush <ge...@aroush.net>.

Hi Dean,

No, I do not intened to use CompareOrdinal -- that would break Lucene.Net.

I have posted this question on Java Lucene mailing list; I got one response
suggesting that Java is doing it wrong.  I am certain about this.

I have done some more research, and so far, I am agreeing with your
analyses.  For example, like you said, using the Danish locale gave me the
same result with Java and .NET.

Does everyone agree that this is not an issue with Lucene.Net 2.0 such that
I should release 2.0 as "final"?  I should point out that this same problem
also existed in 1.9.1, 1.9 1.4.3, 1.4 and earlier releases.  In those
releases, this test didn't exist to expose it.

Regards,

-- George Aroush


-----Original Message-----
From: Dean Harding [mailto:dean.harding@dload.com.au] 
Sent: Wednesday, December 13, 2006 9:39 PM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

You certainly don't want CompareOrdinal!

.NET is doing the right thing in this case, but so is Java. 

The problem is that Ø is not a character that is used in US English (or any
English, for that matter), so the actual order that would be returned when
doing a compare in a locale like en-us is not really important.

What IS important is if you do the comparison in the context of a locale
that DOES use the Ø character. If you change your .NET culture name or your
Java locale to (for example) "da" (that is, Danish) then the results are the
same.

So the bug, I believe, is in the test case which is relying on which is, in
my opinion, undefined.

Dean.


> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Thursday, 14 December 2006 1:07 pm
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi Joe and all,
> 
> I don't think we can use CompareOrdinal() as it doesn't take locale 
> into consideration.
> 
> The issue is with the following function in
> Lucene.Net.Search.FieldSortedHitQueue.cs:
> 
>     public int Compare(ScoreDoc i, ScoreDoc j)
>     {
>         return collator.Compare(index[i.doc].ToString(),
> index[j.doc].ToString());
>     }
> 
> To demonstrate how Java and C# differ in the way they do compare, here 
> is a sample code:
> 
>     // C# code: you get back -1 for 'res'
>     string s1 = "H\u00D8T";
>     string s2 = "HUT";
>     System.Globalization.CultureInfo locale = new 
> System.Globalization.CultureInfo("en-US");
>     System.Globalization.CompareInfo collator = locale.CompareInfo;
>     int res = collator.Compare(s1, s2);
> 
>     // Java code: you get back 1 for 'res'
>     String s1 = "H\u00D8T";
>     String s2 = "HUT";
>     Collator collator = Collator.getInstance (Locale.US);
>     int diff = collator.compare(s1, s2);
> 
> Who is doing the right thing?  Or am I missing additional calls before 
> I can compare?
> 
> My goal is to understand why the difference exist and thus we can 
> judge how serious this is and either fix it or accept it as a language 
> difference.
> 
> Btw, I am going to post this question on the Java Lucene mailing list 
> to see what folks on the Java land have to say.
> 
> Regards,
> 
> -- George Aroush
> 
> 
> -----Original Message-----
> From: Joe Shaw [mailto:joeshaw@novell.com]
> Sent: Wednesday, December 13, 2006 1:35 PM
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi,
> 
> On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> > This is why those two tests are failing and I wander if this is a 
> > defect in NET or in the way the culture info is used in those two 
> > languages or if there is more culture setting I have to do in .NET.
> >
> > My thinking is, in .NET during compare, "\u00D8", is being treated 
> > as ASCII "O" and not the Unicode character that it really is.
> 
> This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
> 
> I think that the sort order is just different between .NET and Java -- 
> ie, the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in 
> Java -- at least in the culture you're using.
> 
> If you're looking for the actual numerical values of the characters 
> for comparison (in which "\u00D8" would be quite a bit higher than both
"O"
> and "U", you probably want to use String.CompareOrdinal()).
> 
> BTW, doing culture insensitive string comparisons might be a good 
> thing to do anyway.  From the MSDN docs for String.Compare(string,
string):
> 
>         The comparison uses the current culture to obtain
>         culture-specific information such as casing rules and the
>         alphabetic order of individual characters. For example, a
>         culture could specify that certain combinations of characters be
>         treated as a single character, or uppercase and lowercase
>         characters be compared in a particular way, or that the sorting
>         order of a character depends on the characters that precede or
>         follow it.
> 
> For more info, see the String.Compare() docs:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/cpref/html/
> frlrfsystemStringclassComparetopic.asp
> 
> Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by Dean Harding <de...@dload.com.au>.

You certainly don't want CompareOrdinal!

.NET is doing the right thing in this case, but so is Java. 

The problem is that Ø is not a character that is used in US English (or any
English, for that matter), so the actual order that would be returned when
doing a compare in a locale like en-us is not really important.

What IS important is if you do the comparison in the context of a locale
that DOES use the Ø character. If you change your .NET culture name or your
Java locale to (for example) "da" (that is, Danish) then the results are the
same.

So the bug, I believe, is in the test case which is relying on which is, in
my opinion, undefined.

Dean.


> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Thursday, 14 December 2006 1:07 pm
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi Joe and all,
> 
> I don't think we can use CompareOrdinal() as it doesn't take locale into
> consideration.
> 
> The issue is with the following function in
> Lucene.Net.Search.FieldSortedHitQueue.cs:
> 
>     public int Compare(ScoreDoc i, ScoreDoc j)
>     {
>         return collator.Compare(index[i.doc].ToString(),
> index[j.doc].ToString());
>     }
> 
> To demonstrate how Java and C# differ in the way they do compare, here is
> a
> sample code:
> 
>     // C# code: you get back -1 for 'res'
>     string s1 = "H\u00D8T";
>     string s2 = "HUT";
>     System.Globalization.CultureInfo locale = new
> System.Globalization.CultureInfo("en-US");
>     System.Globalization.CompareInfo collator = locale.CompareInfo;
>     int res = collator.Compare(s1, s2);
> 
>     // Java code: you get back 1 for 'res'
>     String s1 = "H\u00D8T";
>     String s2 = "HUT";
>     Collator collator = Collator.getInstance (Locale.US);
>     int diff = collator.compare(s1, s2);
> 
> Who is doing the right thing?  Or am I missing additional calls before I
> can
> compare?
> 
> My goal is to understand why the difference exist and thus we can judge
> how
> serious this is and either fix it or accept it as a language difference.
> 
> Btw, I am going to post this question on the Java Lucene mailing list to
> see
> what folks on the Java land have to say.
> 
> Regards,
> 
> -- George Aroush
> 
> 
> -----Original Message-----
> From: Joe Shaw [mailto:joeshaw@novell.com]
> Sent: Wednesday, December 13, 2006 1:35 PM
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi,
> 
> On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> > This is why those two tests are failing and I wander if this is a
> > defect in NET or in the way the culture info is used in those two
> > languages or if there is more culture setting I have to do in .NET.
> >
> > My thinking is, in .NET during compare, "\u00D8", is being treated as
> > ASCII "O" and not the Unicode character that it really is.
> 
> This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
> 
> I think that the sort order is just different between .NET and Java -- ie,
> the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java --
> at
> least in the culture you're using.
> 
> If you're looking for the actual numerical values of the characters for
> comparison (in which "\u00D8" would be quite a bit higher than both "O"
> and "U", you probably want to use String.CompareOrdinal()).
> 
> BTW, doing culture insensitive string comparisons might be a good thing to
> do anyway.  From the MSDN docs for String.Compare(string, string):
> 
>         The comparison uses the current culture to obtain
>         culture-specific information such as casing rules and the
>         alphabetic order of individual characters. For example, a
>         culture could specify that certain combinations of characters be
>         treated as a single character, or uppercase and lowercase
>         characters be compared in a particular way, or that the sorting
>         order of a character depends on the characters that precede or
>         follow it.
> 
> For more info, see the String.Compare() docs:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/cpref/html/
> frlrfsystemStringclassComparetopic.asp
> 
> Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by Dean Harding <de...@dload.com.au>.

You certainly don't want CompareOrdinal!

.NET is doing the right thing in this case, but so is Java. 

The problem is that Ø is not a character that is used in US English (or any
English, for that matter), so the actual order that would be returned when
doing a compare in a locale like en-us is not really important.

What IS important is if you do the comparison in the context of a locale
that DOES use the Ø character. If you change your .NET culture name or your
Java locale to (for example) "da" (that is, Danish) then the results are the
same.

So the bug, I believe, is in the test case which is relying on which is, in
my opinion, undefined.

Dean.


> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Thursday, 14 December 2006 1:07 pm
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi Joe and all,
> 
> I don't think we can use CompareOrdinal() as it doesn't take locale into
> consideration.
> 
> The issue is with the following function in
> Lucene.Net.Search.FieldSortedHitQueue.cs:
> 
>     public int Compare(ScoreDoc i, ScoreDoc j)
>     {
>         return collator.Compare(index[i.doc].ToString(),
> index[j.doc].ToString());
>     }
> 
> To demonstrate how Java and C# differ in the way they do compare, here is
> a
> sample code:
> 
>     // C# code: you get back -1 for 'res'
>     string s1 = "H\u00D8T";
>     string s2 = "HUT";
>     System.Globalization.CultureInfo locale = new
> System.Globalization.CultureInfo("en-US");
>     System.Globalization.CompareInfo collator = locale.CompareInfo;
>     int res = collator.Compare(s1, s2);
> 
>     // Java code: you get back 1 for 'res'
>     String s1 = "H\u00D8T";
>     String s2 = "HUT";
>     Collator collator = Collator.getInstance (Locale.US);
>     int diff = collator.compare(s1, s2);
> 
> Who is doing the right thing?  Or am I missing additional calls before I
> can
> compare?
> 
> My goal is to understand why the difference exist and thus we can judge
> how
> serious this is and either fix it or accept it as a language difference.
> 
> Btw, I am going to post this question on the Java Lucene mailing list to
> see
> what folks on the Java land have to say.
> 
> Regards,
> 
> -- George Aroush
> 
> 
> -----Original Message-----
> From: Joe Shaw [mailto:joeshaw@novell.com]
> Sent: Wednesday, December 13, 2006 1:35 PM
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi,
> 
> On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> > This is why those two tests are failing and I wander if this is a
> > defect in NET or in the way the culture info is used in those two
> > languages or if there is more culture setting I have to do in .NET.
> >
> > My thinking is, in .NET during compare, "\u00D8", is being treated as
> > ASCII "O" and not the Unicode character that it really is.
> 
> This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
> 
> I think that the sort order is just different between .NET and Java -- ie,
> the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java --
> at
> least in the culture you're using.
> 
> If you're looking for the actual numerical values of the characters for
> comparison (in which "\u00D8" would be quite a bit higher than both "O"
> and "U", you probably want to use String.CompareOrdinal()).
> 
> BTW, doing culture insensitive string comparisons might be a good thing to
> do anyway.  From the MSDN docs for String.Compare(string, string):
> 
>         The comparison uses the current culture to obtain
>         culture-specific information such as casing rules and the
>         alphabetic order of individual characters. For example, a
>         culture could specify that certain combinations of characters be
>         treated as a single character, or uppercase and lowercase
>         characters be compared in a particular way, or that the sorting
>         order of a character depends on the characters that precede or
>         follow it.
> 
> For more info, see the String.Compare() docs:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/cpref/html/
> frlrfsystemStringclassComparetopic.asp
> 
> Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by George Aroush <ge...@aroush.net>.

Hi Joe and all,

I don't think we can use CompareOrdinal() as it doesn't take locale into
consideration.

The issue is with the following function in
Lucene.Net.Search.FieldSortedHitQueue.cs:

    public int Compare(ScoreDoc i, ScoreDoc j)
    {
        return collator.Compare(index[i.doc].ToString(),
index[j.doc].ToString());
    }

To demonstrate how Java and C# differ in the way they do compare, here is a
sample code:

    // C# code: you get back -1 for 'res'
    string s1 = "H\u00D8T";
    string s2 = "HUT";
    System.Globalization.CultureInfo locale = new
System.Globalization.CultureInfo("en-US");
    System.Globalization.CompareInfo collator = locale.CompareInfo;
    int res = collator.Compare(s1, s2);

    // Java code: you get back 1 for 'res'
    String s1 = "H\u00D8T";
    String s2 = "HUT";
    Collator collator = Collator.getInstance (Locale.US);
    int diff = collator.compare(s1, s2);

Who is doing the right thing?  Or am I missing additional calls before I can
compare?

My goal is to understand why the difference exist and thus we can judge how
serious this is and either fix it or accept it as a language difference.

Btw, I am going to post this question on the Java Lucene mailing list to see
what folks on the Java land have to say.

Regards,

-- George Aroush

-----Original Message-----
From: Joe Shaw [mailto:joeshaw@novell.com] 
Sent: Wednesday, December 13, 2006 1:35 PM
To: lucene-net-dev@incubator.apache.org
Cc: lucene-net-user@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

Hi,

On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> This is why those two tests are failing and I wander if this is a 
> defect in NET or in the way the culture info is used in those two 
> languages or if there is more culture setting I have to do in .NET.
> 
> My thinking is, in .NET during compare, "\u00D8", is being treated as 
> ASCII "O" and not the Unicode character that it really is.

This isn't the case, because if so "HOT" would be equal to "H\u00D8T".  

I think that the sort order is just different between .NET and Java -- ie,
the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java -- at
least in the culture you're using.  

If you're looking for the actual numerical values of the characters for
comparison (in which "\u00D8" would be quite a bit higher than both "O"
and "U", you probably want to use String.CompareOrdinal()).

BTW, doing culture insensitive string comparisons might be a good thing to
do anyway.  From the MSDN docs for String.Compare(string, string):

        The comparison uses the current culture to obtain
        culture-specific information such as casing rules and the
        alphabetic order of individual characters. For example, a
        culture could specify that certain combinations of characters be
        treated as a single character, or uppercase and lowercase
        characters be compared in a particular way, or that the sorting
        order of a character depends on the characters that precede or
        follow it.

For more info, see the String.Compare() docs:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/
frlrfsystemStringclassComparetopic.asp 

Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by George Aroush <ge...@aroush.net>.

Hi Joe and all,

I don't think we can use CompareOrdinal() as it doesn't take locale into
consideration.

The issue is with the following function in
Lucene.Net.Search.FieldSortedHitQueue.cs:

    public int Compare(ScoreDoc i, ScoreDoc j)
    {
        return collator.Compare(index[i.doc].ToString(),
index[j.doc].ToString());
    }

To demonstrate how Java and C# differ in the way they do compare, here is a
sample code:

    // C# code: you get back -1 for 'res'
    string s1 = "H\u00D8T";
    string s2 = "HUT";
    System.Globalization.CultureInfo locale = new
System.Globalization.CultureInfo("en-US");
    System.Globalization.CompareInfo collator = locale.CompareInfo;
    int res = collator.Compare(s1, s2);

    // Java code: you get back 1 for 'res'
    String s1 = "H\u00D8T";
    String s2 = "HUT";
    Collator collator = Collator.getInstance (Locale.US);
    int diff = collator.compare(s1, s2);

Who is doing the right thing?  Or am I missing additional calls before I can
compare?

My goal is to understand why the difference exist and thus we can judge how
serious this is and either fix it or accept it as a language difference.

Btw, I am going to post this question on the Java Lucene mailing list to see
what folks on the Java land have to say.

Regards,

-- George Aroush

-----Original Message-----
From: Joe Shaw [mailto:joeshaw@novell.com] 
Sent: Wednesday, December 13, 2006 1:35 PM
To: lucene-net-dev@incubator.apache.org
Cc: lucene-net-user@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

Hi,

On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> This is why those two tests are failing and I wander if this is a 
> defect in NET or in the way the culture info is used in those two 
> languages or if there is more culture setting I have to do in .NET.
> 
> My thinking is, in .NET during compare, "\u00D8", is being treated as 
> ASCII "O" and not the Unicode character that it really is.

This isn't the case, because if so "HOT" would be equal to "H\u00D8T".  

I think that the sort order is just different between .NET and Java -- ie,
the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java -- at
least in the culture you're using.  

If you're looking for the actual numerical values of the characters for
comparison (in which "\u00D8" would be quite a bit higher than both "O"
and "U", you probably want to use String.CompareOrdinal()).

BTW, doing culture insensitive string comparisons might be a good thing to
do anyway.  From the MSDN docs for String.Compare(string, string):

        The comparison uses the current culture to obtain
        culture-specific information such as casing rules and the
        alphabetic order of individual characters. For example, a
        culture could specify that certain combinations of characters be
        treated as a single character, or uppercase and lowercase
        characters be compared in a particular way, or that the sorting
        order of a character depends on the characters that precede or
        follow it.

For more info, see the String.Compare() docs:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/
frlrfsystemStringclassComparetopic.asp 

Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by Joe Shaw <jo...@novell.com>.

Hi,

On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> This is why those two tests are failing and I wander if this is a defect in
> NET or in the way the culture info is used in those two languages or if
> there is more culture setting I have to do in .NET.
> 
> My thinking is, in .NET during compare, "\u00D8", is being treated as ASCII
> "O" and not the Unicode character that it really is.

This isn't the case, because if so "HOT" would be equal to 
"H\u00D8T".  

I think that the sort order is just different between .NET and Java --
ie, the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in
Java -- at least in the culture you're using.  

If you're looking for the actual numerical values of the characters for
comparison (in which "\u00D8" would be quite a bit higher than both "O"
and "U", you probably want to use String.CompareOrdinal()).

BTW, doing culture insensitive string comparisons might be a good thing
to do anyway.  From the MSDN docs for String.Compare(string, string):

        The comparison uses the current culture to obtain
        culture-specific information such as casing rules and the
        alphabetic order of individual characters. For example, a
        culture could specify that certain combinations of characters be
        treated as a single character, or uppercase and lowercase
        characters be compared in a particular way, or that the sorting
        order of a character depends on the characters that precede or
        follow it.

For more info, see the String.Compare() docs:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemStringclassComparetopic.asp 

Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by Joe Shaw <jo...@novell.com>.

Hi,

On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> This is why those two tests are failing and I wander if this is a defect in
> NET or in the way the culture info is used in those two languages or if
> there is more culture setting I have to do in .NET.
> 
> My thinking is, in .NET during compare, "\u00D8", is being treated as ASCII
> "O" and not the Unicode character that it really is.

This isn't the case, because if so "HOT" would be equal to 
"H\u00D8T".  

I think that the sort order is just different between .NET and Java --
ie, the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in
Java -- at least in the culture you're using.  

If you're looking for the actual numerical values of the characters for
comparison (in which "\u00D8" would be quite a bit higher than both "O"
and "U", you probably want to use String.CompareOrdinal()).

BTW, doing culture insensitive string comparisons might be a good thing
to do anyway.  From the MSDN docs for String.Compare(string, string):

        The comparison uses the current culture to obtain
        culture-specific information such as casing rules and the
        alphabetic order of individual characters. For example, a
        culture could specify that certain combinations of characters be
        treated as a single character, or uppercase and lowercase
        characters be compared in a particular way, or that the sorting
        order of a character depends on the characters that precede or
        follow it.

For more info, see the String.Compare() docs:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemStringclassComparetopic.asp 

Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by George Aroush <ge...@aroush.net>.

Hi Torsten,

Thanks for the explanation and the sample program.  However, if you change
your code so that "HUT" is used instead of "HOT" (per my original email),
the value returned will now be 1, instead of -1.  In Java, I get -1 which is
what I believe the right answer is.

This is why those two tests are failing and I wander if this is a defect in
.NET or in the way the culture info is used in those two languages or if
there is more culture setting I have to do in .NET.

My thinking is, in .NET during compare, "\u00D8", is being treated as ASCII
"O" and not the Unicode character that it really is.

Regards,

-- George Aroush
 

-----Original Message-----
From: Torsten Rendelmann [mailto:torsten.rendelmann@gmx.net] 
Sent: Wednesday, December 13, 2006 2:15 AM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

Hi George,

CLR always handles "string" as Unicode, but comparison code like "a" == "b"
will always take the current system culture to compare. So it is even better
to use
String.Compare() instead, there you have all at hand what influence the
result:
the used Comparer, Culture, case-sensitivity etc.

I tested a little bit with CLR 2.0 (but String.Compare() calls are similar
in CLR 1.1/1.0), here is the code and the results as comments:

[TestMethod]
public void TestMethod1()
{
	string one = "HOT";
	string two = "H\u00D8T";

	int res = String.Compare(one, two); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two): {0}",
res));
	res = String.CompareOrdinal(one, two); // -137
	Debug.WriteLine(String.Format("String.CompareOrdinal(one, two):
{0}", res));
	res = String.Compare(one, two,
StringComparison.InvariantCulture); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two,
StringComparison.InvariantCulture): {0}", res));
	res = String.Compare(one, two, true,
CultureInfo.CreateSpecificCulture("en-US")); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two, true,
CultureInfo.CreateSpecificCulture('en-US')): {0}", res));
	res = String.Compare(one, two, false,
CultureInfo.CreateSpecificCulture("en-US")); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two, false,
CultureInfo.CreateSpecificCulture('en-US')): {0}", res)); } 

String.Compare() doc:
 result < 0		String one is less than two
 result == 0	String one is equal two
 result > 0		String one is greater than two

Kindly, TorstenR

> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Wednesday, December 13, 2006 5:46 AM
> To: lucene-net-dev@incubator.apache.org;
> lucene-net-user@incubator.apache.org
> Subject: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi folks,
> 
> One of the remaining issues with Lucene.Net 2.0 is two tests that are 
> failing, TestInternationalMultiSearcherSort and TestInternationalSort.
> 
> After few hours of debugging, I discovered that in C#, "H\u00D8T" < 
> "HUT"
> but in Java, "H\u00D8T" > "HUT" (here, "H\u00D8T" is in Unicode and is 
> actually "Ø")
> 
> The culture-info / local used are, in C# "en-US" and in Java 
> "Locale.US".
> 
> The fail point occurs because, I think, 
> System.Globalization.CompareInfo is not treating the string as 
> Unicode; "\u00D8" is being treated as ASCII "O".
> If that's the case, how do I tell .NET to use Unicode?
> 
> IF you know why .NET is behaving differently here, please let me know.
> 
> Regards,
> 
> -- George Aroush
>

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by George Aroush <ge...@aroush.net>.

Hi Torsten,

Thanks for the explanation and the sample program.  However, if you change
your code so that "HUT" is used instead of "HOT" (per my original email),
the value returned will now be 1, instead of -1.  In Java, I get -1 which is
what I believe the right answer is.

This is why those two tests are failing and I wander if this is a defect in
.NET or in the way the culture info is used in those two languages or if
there is more culture setting I have to do in .NET.

My thinking is, in .NET during compare, "\u00D8", is being treated as ASCII
"O" and not the Unicode character that it really is.

Regards,

-- George Aroush
 

-----Original Message-----
From: Torsten Rendelmann [mailto:torsten.rendelmann@gmx.net] 
Sent: Wednesday, December 13, 2006 2:15 AM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

Hi George,

CLR always handles "string" as Unicode, but comparison code like "a" == "b"
will always take the current system culture to compare. So it is even better
to use
String.Compare() instead, there you have all at hand what influence the
result:
the used Comparer, Culture, case-sensitivity etc.

I tested a little bit with CLR 2.0 (but String.Compare() calls are similar
in CLR 1.1/1.0), here is the code and the results as comments:

[TestMethod]
public void TestMethod1()
{
	string one = "HOT";
	string two = "H\u00D8T";

	int res = String.Compare(one, two); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two): {0}",
res));
	res = String.CompareOrdinal(one, two); // -137
	Debug.WriteLine(String.Format("String.CompareOrdinal(one, two):
{0}", res));
	res = String.Compare(one, two,
StringComparison.InvariantCulture); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two,
StringComparison.InvariantCulture): {0}", res));
	res = String.Compare(one, two, true,
CultureInfo.CreateSpecificCulture("en-US")); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two, true,
CultureInfo.CreateSpecificCulture('en-US')): {0}", res));
	res = String.Compare(one, two, false,
CultureInfo.CreateSpecificCulture("en-US")); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two, false,
CultureInfo.CreateSpecificCulture('en-US')): {0}", res)); } 

String.Compare() doc:
 result < 0		String one is less than two
 result == 0	String one is equal two
 result > 0		String one is greater than two

Kindly, TorstenR

> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Wednesday, December 13, 2006 5:46 AM
> To: lucene-net-dev@incubator.apache.org;
> lucene-net-user@incubator.apache.org
> Subject: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi folks,
> 
> One of the remaining issues with Lucene.Net 2.0 is two tests that are 
> failing, TestInternationalMultiSearcherSort and TestInternationalSort.
> 
> After few hours of debugging, I discovered that in C#, "H\u00D8T" < 
> "HUT"
> but in Java, "H\u00D8T" > "HUT" (here, "H\u00D8T" is in Unicode and is 
> actually "Ø")
> 
> The culture-info / local used are, in C# "en-US" and in Java 
> "Locale.US".
> 
> The fail point occurs because, I think, 
> System.Globalization.CompareInfo is not treating the string as 
> Unicode; "\u00D8" is being treated as ASCII "O".
> If that's the case, how do I tell .NET to use Unicode?
> 
> IF you know why .NET is behaving differently here, please let me know.
> 
> Regards,
> 
> -- George Aroush
>

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Posted by Torsten Rendelmann <to...@gmx.net>.

Hi George,

CLR always handles "string" as Unicode, but comparison code like "a" ==
"b" will
always take the current system culture to compare. So it is even better
to use
String.Compare() instead, there you have all at hand what influence the
result:
the used Comparer, Culture, case-sensitivity etc.

I tested a little bit with CLR 2.0 (but String.Compare() calls are
similar in
CLR 1.1/1.0), here is the code and the results as comments:

[TestMethod]
public void TestMethod1()
{
	string one = "HOT";
	string two = "H\u00D8T";

	int res = String.Compare(one, two); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two): {0}",
res));
	res = String.CompareOrdinal(one, two); // -137
	Debug.WriteLine(String.Format("String.CompareOrdinal(one, two):
{0}", res));
	res = String.Compare(one, two,
StringComparison.InvariantCulture); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two,
StringComparison.InvariantCulture): {0}", res));
	res = String.Compare(one, two, true,
CultureInfo.CreateSpecificCulture("en-US")); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two, true,
CultureInfo.CreateSpecificCulture('en-US')): {0}", res));
	res = String.Compare(one, two, false,
CultureInfo.CreateSpecificCulture("en-US")); // -1
	Debug.WriteLine(String.Format("String.Compare(one, two, false,
CultureInfo.CreateSpecificCulture('en-US')): {0}", res));
} 

String.Compare() doc:
 result < 0		String one is less than two
 result == 0	String one is equal two
 result > 0		String one is greater than two

Kindly, TorstenR

> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net] 
> Sent: Wednesday, December 13, 2006 5:46 AM
> To: lucene-net-dev@incubator.apache.org; 
> lucene-net-user@incubator.apache.org
> Subject: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi folks,
> 
> One of the remaining issues with Lucene.Net 2.0 is two tests that are
> failing, TestInternationalMultiSearcherSort and TestInternationalSort.
> 
> After few hours of debugging, I discovered that in C#, 
> "H\u00D8T" < "HUT"
> but in Java, "H\u00D8T" > "HUT" (here, "H\u00D8T" is in Unicode and is
> actually "Ø")
> 
> The culture-info / local used are, in C# "en-US" and in Java 
> "Locale.US".
> 
> The fail point occurs because, I think, 
> System.Globalization.CompareInfo is
> not treating the string as Unicode; "\u00D8" is being treated 
> as ASCII "O".
> If that's the case, how do I tell .NET to use Unicode?
> 
> IF you know why .NET is behaving differently here, please let me know.
> 
> Regards,
> 
> -- George Aroush
>