You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Brett Wooldridge (JIRA)" <ji...@apache.org> on 2014/06/11 09:51:01 UTC

[jira] [Updated] (DERBY-6607) Derby is using territory/collation for equality, not just ordering (incorrectly?)

     [ https://issues.apache.org/jira/browse/DERBY-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brett Wooldridge updated DERBY-6607:
------------------------------------

    Description: 
We have a database where we wish case-insensitivity, and therefore it was created with collation=TERRITORY_BASED:PRIMARY.  We have customers in both the United States (en_US) and in Japan (ja_JP).

We have an issue in Japan.  Japanese has three character sets: hiragana, katakana, and kanji.  Hiragana is a phonetic alphabet with 46 letters.  Katakana is an identical phonetic alphabet with 46 letters, written using different character forms, and used for foreign words (words adopted from other languages into Japanese).

Here is the word 'cake' written in katakana: ケーキ (ke- ki)
Here is the word 'cake' written in hiragana: けーき  (ke- ki)

In terms of collation (ordering), Japanese consider these to be equal.  So, in the following Java code, the call to 'compare()' would return 0:
{code:java}
Collator collator = Collator.getInstance(Locale.JAPAN);
collator.setStrength(Collator.PRIMARY);
return collator.compare("ケーキ", "けーき");
{code}

And therein lies the issue.  With respect to _ordering_ they are indeed equivalent, however Japanese would consider them district  (non-equivalent) values.

When a table is declared with a UNIQUE constraint on a column, or a PRIMARY KEY column, if 'ケーキ' exists in the table, Derby will throw a unique constraint violation upon an attempt to insert 'けーき'.

We need collation=TERRITORY_BASED:PRIMARY or TERRITORY_BASED:SECONDARY for case-insensitivity _and_ at the same time need these values to be treated as unique.

Is it "correct" that Derby use the collation when determining value equivalency vs. ordering equivalency?

At the same time, I understand that this is tricky.  Japanese has no "upper-case" and "lower-case" for hiragana, katakana, or kanji, however they do use "romanji" (roman characters) which are essentially ASCII, which is case-sensitive.  Collation is merely used for ordering.  So when  TERRITORY_BASED:PRIMARY/SECONDARY is used, for Japanese, 'cat' and 'CAT' would be equivalent but 'ケーキ' and 'けーき' _would not be_.  Unfortunately, there is only one Collator and it will identify _both_ of these as equivalent.


  was:
We have a database where we wish case-insensitivity, and therefore it was created with collation=TERRITORY_BASED:PRIMARY.  We have customers in both the United States (en_US) and in Japan (ja_JP).

We have an issue in Japan.  Japanese has three character sets: hiragana, katakana, and kanji.  Hiragana is a phonetic alphabet with 46 letters.  Katakana is an identical phonetic alphabet with 46 letters, written using different character forms, and used for foreign words (words adopted from other languages into Japanese).

Here is the word 'cake' written in katakana: ケーキ (ke- ki)
Here is the word 'cake' written in hiragana: けーき  (ke- ki)

In terms of collation (ordering), Japanese consider these to be equal.  So, in the following Java code, the call to 'compare()' would return 0:
{code:java}
Collator collator = Collator.getInstance(Locale.JAPAN);
collator.setStrength(Collator.PRIMARY);
return collator.compare("ケーキ", "けーき");
{code}

And therein lies the issue.  With respect to _ordering_ they are indeed equivalent, however Japanese would consider them district  (non-equivalent) values.

When a table is declared with a UNIQUE constraint on a column, or a PRIMARY KEY column, if 'ケーキ' exists in the table, Derby will throw a unique constraint violation upon an attempt to insert 'けーき'.

We need collation=TERRITORY_BASED:PRIMARY or TERRITORY_BASED:SECONDARY for case-insensitivity _and_ at the same time need these values to be treated as unique.

Is it "correct" that Derby use the collation when determining value equivalency vs. ordering equivalency?  It seems to us that collation should only be used with respect to ordering, but not with respect to DML (INSERT, UPDATE, DELETE).



> Derby is using territory/collation for equality, not just ordering (incorrectly?)
> ---------------------------------------------------------------------------------
>
>                 Key: DERBY-6607
>                 URL: https://issues.apache.org/jira/browse/DERBY-6607
>             Project: Derby
>          Issue Type: Bug
>          Components: Localization
>    Affects Versions: 10.10.2.0
>            Reporter: Brett Wooldridge
>
> We have a database where we wish case-insensitivity, and therefore it was created with collation=TERRITORY_BASED:PRIMARY.  We have customers in both the United States (en_US) and in Japan (ja_JP).
> We have an issue in Japan.  Japanese has three character sets: hiragana, katakana, and kanji.  Hiragana is a phonetic alphabet with 46 letters.  Katakana is an identical phonetic alphabet with 46 letters, written using different character forms, and used for foreign words (words adopted from other languages into Japanese).
> Here is the word 'cake' written in katakana: ケーキ (ke- ki)
> Here is the word 'cake' written in hiragana: けーき  (ke- ki)
> In terms of collation (ordering), Japanese consider these to be equal.  So, in the following Java code, the call to 'compare()' would return 0:
> {code:java}
> Collator collator = Collator.getInstance(Locale.JAPAN);
> collator.setStrength(Collator.PRIMARY);
> return collator.compare("ケーキ", "けーき");
> {code}
> And therein lies the issue.  With respect to _ordering_ they are indeed equivalent, however Japanese would consider them district  (non-equivalent) values.
> When a table is declared with a UNIQUE constraint on a column, or a PRIMARY KEY column, if 'ケーキ' exists in the table, Derby will throw a unique constraint violation upon an attempt to insert 'けーき'.
> We need collation=TERRITORY_BASED:PRIMARY or TERRITORY_BASED:SECONDARY for case-insensitivity _and_ at the same time need these values to be treated as unique.
> Is it "correct" that Derby use the collation when determining value equivalency vs. ordering equivalency?
> At the same time, I understand that this is tricky.  Japanese has no "upper-case" and "lower-case" for hiragana, katakana, or kanji, however they do use "romanji" (roman characters) which are essentially ASCII, which is case-sensitive.  Collation is merely used for ordering.  So when  TERRITORY_BASED:PRIMARY/SECONDARY is used, for Japanese, 'cat' and 'CAT' would be equivalent but 'ケーキ' and 'けーき' _would not be_.  Unfortunately, there is only one Collator and it will identify _both_ of these as equivalent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)