You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Knut Anders Hatlen (JIRA)" <ji...@apache.org> on 2008/12/08 18:16:46 UTC

[jira] Commented: (DERBY-3975) SELECT DISTINCT may return duplicates with territory-based collation

    [ https://issues.apache.org/jira/browse/DERBY-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654469#action_12654469 ] 

Knut Anders Hatlen commented on DERBY-3975:
-------------------------------------------

Just a guess, but I think the problem might be that o.a.d.iapi.types.CollatorSQLVarchar doesn't override SQLChar.hashCode().

The reason why the abba/baab duplicate was removed, is that SQLChar.hashCode() just adds up the char values ignoring the positions, so the two strings have the same hash code because they contain the exact same characters (though in different order). Not necessary for this issue, but it would probably be better to calculate with a formula similar to the one described here <URL:http://java.sun.com/javase/6/docs/api/java/lang/String.html#hashCode()>. Just adding the char values will give a poor distribution and the higher bits won't be used unless the string is very long.

> SELECT DISTINCT may return duplicates with territory-based collation
> --------------------------------------------------------------------
>
>                 Key: DERBY-3975
>                 URL: https://issues.apache.org/jira/browse/DERBY-3975
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.4.2.0
>            Reporter: Knut Anders Hatlen
>         Attachments: AisBCollatorProvider.java
>
>
> I defined my own locale (en_US_aisb) where the collation rules said that a=b. When I tried queries with SELECT DISTINCT, they didn't always eliminate all duplicates. Here's an example:
> ij> connect 'jdbc:derby:db;create=true;territory=en_US_aisb;collation=TERRITORY_BASED';
> ij> create table t (x varchar(10));
> 0 rows inserted/updated/deleted
> ij> insert into t values 'a','b','abba','baab','ABBA';
> 5 rows inserted/updated/deleted
> ij> select distinct * from t;
> X         
> ----------
> ABBA      
> b         
> a         
> abba      
> 4 rows selected
> ij> select distinct * from t order by x;
> X         
> ----------
> a         
> abba      
> ABBA      
> 3 rows selected
> The first query did eliminate the duplicate "abba"/"baab", but it did not eliminate the duplicate "a"/"b". When an ORDER BY clause was added (the second query), all the duplicates were eliminated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.