You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Laura Stewart <sc...@gmail.com> on 2007/05/16 21:34:35 UTC

Another collation question - Derby-1478 and Derby-2377

As part of adding the new attribute collation=TERRITORY_BASED, I think
that we need to describe how Derby handles collation.

I am trying to get my head around the best way to describe collation
in Derby... for 10.3

In general terms, a collating sequence is a defined ordering for
character data that determines whether a particular character sorts
higher, lower, or the same as another character.  Each character set
will also have a default collation.

In Derby, it is my understanding that all of our string data types are
represented as Unicode sequences.  Is that correct?

We should have a complete list of the data types that are impacted by
collation.
CHAR
VARCHAR
CLOB ?

Does Derby support the national character datatypes such as NCHAR/NVARCHAR2?

FYI - there is a feeling among some in the Internet community that the
term "character set" is not appropriate.  They tout character code,
character encoding, or character repertoire.

Does Derby support specifying codes?  Is that what the attribute
territory=l_CCI (example territory=es_MX) does?

Is there a complete listing of the territories that are supported...
maybe in a Java spec?

When you create a database, can you specify that the
default character set for CHAR columns be ASCII, and the character set
used for NCHAR be UTF8?

The Derby documentation mentions code sets, but only with relationship
to import/export topics or ij sessions...

Any insite that you can provide on this would be appreciated.

-- 
Laura Stewart

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Mamta Satoor <ms...@gmail.com>.

Wow, Mike has done such a great job of covering the questions that I don't
have much to add. Just answer to one of Laura's question
> Is there a complete listing of the territories that are supported...
> maybe in a Java spec?
As Mike says, this feature (DERBY-1478) does not change the existing support
for territories in any ways. Derby 10.2 reference manual under "Setting
attributes for the database connection URL" has a sub-section called
"territory=ll_CC" and it talks about ll and CC and where the valid values
for them can be found.

Laura, thanks for working on the documentation for DERBY-1478. Let us know
if you have any further questions.

Mamta


On 5/16/07, Mike Matrigali <mi...@sbcglobal.net> wrote:
>
>
>
> Laura Stewart wrote:
> > As part of adding the new attribute collation=TERRITORY_BASED, I think
> > that we need to describe how Derby handles collation.
> >
> > I am trying to get my head around the best way to describe collation
> > in Derby... for 10.3
> >
> > In general terms, a collating sequence is a defined ordering for
> > character data that determines whether a particular character sorts
> > higher, lower, or the same as another character.  Each character set
> > will also have a default collation.
> I would also not use character set.  I would approach documenting it
> based on the behavior of datatypes rather than talk about character
> sets.  So CHAR, VARCHAR, LONG VARCHAR and CLOB comparison/ordering/like
> processing is affected.
>
> >
> > In Derby, it is my understanding that all of our string data types are
> > represented as Unicode sequences.  Is that correct?
> I believe the documentation should only speak to the datatypes rather
> than the underlying storage structure.  To understand current
> implementation all operations on character types use either String or
> java char in memory to perform operations.  JDBC defines how one inputs
> data into the datatypes and retrieves data from the datatypes.
> >
> > We should have a complete list of the data types that are impacted by
> > collation.
> > CHAR
> > VARCHAR
> > CLOB ?
> I believe it is
> CHAR
> VARCHAR
> LONG VARCHAR
> CLOB
> >
> > Does Derby support the national character datatypes such as
> > NCHAR/NVARCHAR2?
> No.
> >
> > FYI - there is a feeling among some in the Internet community that the
> > term "character set" is not appropriate.  They tout character code,
> > character encoding, or character repertoire.
> >
> > Does Derby support specifying codes?  Is that what the attribute
> > territory=l_CCI (example territory=es_MX) does?
> >
> > Is there a complete listing of the territories that are supported...
> > maybe in a Java spec?
> Hopefully mamta can expand here.  I hope that we can define our support
> in terms of the standard interfaces we are using from java to perform
> the ordering if a database has been defined to order based on it's
> territory.
>
> I don't believe 10.3 will change the territories supported, it is the
> same set as 10.2 (basically we support what java supports).  10.3 just
> allows collation to be based on territory, all other territory support
> is unchanged.
> >
> > When you create a database, can you specify that the
> > default character set for CHAR columns be ASCII, and the character set
> > used for NCHAR be UTF8?
> No there is no such thing.  We are not specifying a character set.  You
> specify a teritory, this is existing functionality in 10.2.  In 10.3 you
> specify at database creation time if you want collation of all user
> character data to be determined by the territory or not.  In the current
> implementation it does not change the storage format, but I don't think
> that should be part of the documentation.
>
> Do not get confused by what other databases may have to include in such
> a change.  Derby has always used java String/char support which is
> unicode based, so no difference is needed to operate on non-ascii
> character data.  How Derby chooses to read/write those characters to
> disk is even less important for user interface documentation and could
> be changed in the future.  We happen to currently use a modified UTF8
> scheme (modified to support very long strings), but that is never
> exposed to a user.
> >
> > The Derby documentation mentions code sets, but only with relationship
> > to import/export topics or ij sessions...
> right.  The 10.3 functionality does not change any of this, it only
> affects the ordering within the server.  Different operating systems,
> environments may operate on different codesets outside of derby - but
> once the data has gotten in (through an import, ij, jdbc) then data
> is treated same on all systems.  On exit (export, ij, jdbc) the data
> may then get transformed to a native codeset.  None of this is affected
> by the 10.3 collation changes.
> >
> > Any insite that you can provide on this would be appreciated.
> >
>
>

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Bryan Pendleton <bp...@amberpoint.com>.

> In Norwegian, the character sequence "aa" is to be treated as the single
> letter "å" if it is pronounced identically to "å". Since "a" is the
> first letter of the alphabet and "å" the last letter of the alphabet,
> this has consequences for how words are ordered alphabetically.

This was a great message, and really made things clear.

Can I suggest that this would be wonderful background information
to save somewhere, perhaps on the Wiki, as part of the body
of knowledge about how this part of Derby works?

thanks,

bryan

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Mamta Satoor <ms...@gmail.com>.

Knut, I haven't been able to spend much time on your email reponse but I
just wanted to share that Derby is not doing anything special for any
character (meaning soft hyphens or other punctuation characters). We just
rely on RuleBasedCollator provided by JVM for a given locale and we let that
RuleBasedCollator do all the comparisons (the code can be found in
org.apache.derby.iapi.types.WorkHorseForCollatorDatatypes's following
methods : 2 like methods and stringCompare method).

Mamta

On 5/18/07, Knut Anders Hatlen <Kn...@sun.com> wrote:
>
> Mike Matrigali <mi...@sbcglobal.net> writes:
>
> > Thanks, I have not written the like tests yet, and am looking for
> > examples like the following where the result under the default
> > system is different under collation vs default that can be added
> > to the junit tests, but have to admit I don't know much about
> > languages other than english.
>
> I know very little about how collation is defined in the standards, but
> I would guess the trickiest part is the character sequences that map
> into a single collation element, like ch in Spanish or aa in the
> Scandinavian languages. Since I happen to know Norwegian fairly well,
> I'll try to present what I would expect, and then perhaps someone else
> could chime in and explain how/if those expectations map into the
> standards (Unicode, SQL, +++). Hopefully, this could also give you some
> ideas on how to write some meaningful tests.
>
> In Norwegian, the character sequence "aa" is to be treated as the single
> letter "ו" if it is pronounced identically to "ו". Since "a" is the
> first letter of the alphabet and "ו" the last letter of the alphabet,
> this has consequences for how words are ordered alphabetically. However,
> not all occurrences of "aa" are pronounced as "ו". In fact, today it is
> used this way more or less exclusively in family names. You won't find
> any words in a dictionary where a double a is to be pronounced as "ו",
> only in lists of names.
>
> So if you have a word like "ekstraarbeid" (an actual word found in the
> dictionary), it should be listed before "ekstrabetaling" (another actual
> word), even though aa = ו > b, because the double a is pronounced as two
> separate a's.
>
> Similarly, in the phone book, you will find "Haase" before "Hatlen" (aa
> in Haase is a long a, hence counted as two letters), but you'll find
> "Wanvik" before "Waagan" (aa in Waagan is pronounced and alphabetized as
> ו). This has some funny consequences like that the very first name in
> the phone book for Trondheim, Norway is "Aalaei", whereas the last name
> you find in it is "Aavitsland".
>
> So, my expectation is that there is some way to have a list of words
> sorted like this:
>
> Aalaei
> ekstraarbeid
> ekstrabetaling
> Haase
> Hatlen
> Wanvik
> Waagan
> Aavitsland
>
> The way these words are sorted currently with territory based collation
> and Norwegian territory is:
>
> ekstrabetaling
> ekstraarbeid
> Hatlen
> Haase
> Wanvik
> Waagan
> Aalaei
> Aavitsland
>
> I skimmed through the Unicode Collation Algorithm at
> http://unicode.org/reports/tr10/ to find out how this were to be
> handled. A paragraph under 3.1.1 Multiple Mappings said:
>
> Any character (such as soft hyphen) that is not completely ignorable
> between two characters of a contraction will cause them to sort as
> separate characters. Thus a soft hyphen can be used to separate and
> cause distinct weighting of sequences such as Slovak ch or Danish aa
> that would normally weight as units.
>
> This sounds like what I need, and placing a soft hyphen between the a's
> that I wanted to be interpreted as two single letters, did indeed give
> me the sorting order I wanted.
>
> However, even though the sorting seems to ignore the soft hyphens
> (actually, it seems to ignore all kinds of punctuation characters),
> string matching does not ignore them, so 'H_ase' does not match
> 'Ha<soft-hyphen>ase' with the LIKE predicate. Is this supposed to be
> possible, that is, to let LIKE regard 'aa' (or 'a<some-special-char>a')
> as two separate yet consecutive letters?
>
> --
> Knut Anders
>

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Mike Matrigali <mi...@sbcglobal.net> writes:

> Thanks, I have not written the like tests yet, and am looking for
> examples like the following where the result under the default
> system is different under collation vs default that can be added
> to the junit tests, but have to admit I don't know much about
> languages other than english.

I know very little about how collation is defined in the standards, but
I would guess the trickiest part is the character sequences that map
into a single collation element, like ch in Spanish or aa in the
Scandinavian languages. Since I happen to know Norwegian fairly well,
I'll try to present what I would expect, and then perhaps someone else
could chime in and explain how/if those expectations map into the
standards (Unicode, SQL, +++). Hopefully, this could also give you some
ideas on how to write some meaningful tests.

In Norwegian, the character sequence "aa" is to be treated as the single
letter "å" if it is pronounced identically to "å". Since "a" is the
first letter of the alphabet and "å" the last letter of the alphabet,
this has consequences for how words are ordered alphabetically. However,
not all occurrences of "aa" are pronounced as "å". In fact, today it is
used this way more or less exclusively in family names. You won't find
any words in a dictionary where a double a is to be pronounced as "å",
only in lists of names.

So if you have a word like "ekstraarbeid" (an actual word found in the
dictionary), it should be listed before "ekstrabetaling" (another actual
word), even though aa = å > b, because the double a is pronounced as two
separate a's.

Similarly, in the phone book, you will find "Haase" before "Hatlen" (aa
in Haase is a long a, hence counted as two letters), but you'll find
"Wanvik" before "Waagan" (aa in Waagan is pronounced and alphabetized as
å). This has some funny consequences like that the very first name in
the phone book for Trondheim, Norway is "Aalaei", whereas the last name
you find in it is "Aavitsland".

So, my expectation is that there is some way to have a list of words
sorted like this:

Aalaei
ekstraarbeid
ekstrabetaling
Haase
Hatlen
Wanvik
Waagan
Aavitsland

The way these words are sorted currently with territory based collation
and Norwegian territory is:

ekstrabetaling      
ekstraarbeid        
Hatlen              
Haase               
Wanvik              
Waagan              
Aalaei              
Aavitsland          

I skimmed through the Unicode Collation Algorithm at
http://unicode.org/reports/tr10/ to find out how this were to be
handled. A paragraph under 3.1.1 Multiple Mappings said:

  Any character (such as soft hyphen) that is not completely ignorable
  between two characters of a contraction will cause them to sort as
  separate characters. Thus a soft hyphen can be used to separate and
  cause distinct weighting of sequences such as Slovak ch or Danish aa
  that would normally weight as units.

This sounds like what I need, and placing a soft hyphen between the a's
that I wanted to be interpreted as two single letters, did indeed give
me the sorting order I wanted.

However, even though the sorting seems to ignore the soft hyphens
(actually, it seems to ignore all kinds of punctuation characters),
string matching does not ignore them, so 'H_ase' does not match
'Ha<soft-hyphen>ase' with the LIKE predicate. Is this supposed to be
possible, that is, to let LIKE regard 'aa' (or 'a<some-special-char>a')
as two separate yet consecutive letters?

-- 
Knut Anders

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Thanks, I have not written the like tests yet, and am looking for 
examples like the following where the result under the default
system is different under collation vs default that can be added
to the junit tests, but have to admit I don't know much about
languages other than english.  I think I need unicode escape for non-ascii
characters so that I can get them into a java program using my
terminal.

I currently test english, polish, and norway but only have some
simple ordering examples so far, but the test is set up to
easily add other territories if anyone is interested.

Knut Anders Hatlen wrote:
> Mamta Satoor <ms...@gmail.com> writes:
> 
> 
>>Knut, only LIKE comparison can be done on LONG VARCHAR and CLOB data types.
> 
> 
> Thanks, Mamta. And LIKE comparisons are indeed affected by collation:
> 
> ij> connect 'jdbc:derby:db;create=true;collation=TERRITORY_BASED;territory=no_NO';
> ij> create table clobs (c clob(10));
> 0 rows inserted/updated/deleted
> ij> insert into clobs values 'Waagan';
> 1 row inserted/updated/deleted
> ij> select * from clobs where c like 'W_gan';
> C         
> ----------
> Waagan    
> 
> 1 row selected
> 
> Cool! :)
>

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Mamta Satoor <ms...@gmail.com> writes:

> Knut, only LIKE comparison can be done on LONG VARCHAR and CLOB data types.

Thanks, Mamta. And LIKE comparisons are indeed affected by collation:

ij> connect 'jdbc:derby:db;create=true;collation=TERRITORY_BASED;territory=no_NO';
ij> create table clobs (c clob(10));
0 rows inserted/updated/deleted
ij> insert into clobs values 'Waagan';
1 row inserted/updated/deleted
ij> select * from clobs where c like 'W_gan';
C         
----------
Waagan    

1 row selected

Cool! :)

-- 
Knut Anders

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Mamta Satoor <ms...@gmail.com>.

Knut, only LIKE comparison can be done on LONG VARCHAR and CLOB data types.

Mamta


On 5/16/07, Knut Anders Hatlen <Kn...@sun.com> wrote:
>
> Mike Matrigali <mi...@sbcglobal.net> writes:
>
> >> We should have a complete list of the data types that are impacted by
> >> collation.
> >> CHAR
> >> VARCHAR
> >> CLOB ?
> > I believe it is
> > CHAR
> > VARCHAR
> > LONG VARCHAR
> > CLOB
>
> How are LONG VARCHAR and CLOB affected by collation? It seems like Derby
> doesn't support comparisons involving these data types. For instance,
> when trying to order by a CLOB column, I see this error:
>
> ERROR X0X67: Columns of type 'CLOB' may not be used in CREATE INDEX,
> ORDER BY, GROUP BY, UNION, INTERSECT, EXCEPT or DISTINCT statements
> because comparisons are not supported for that type.
>
> --
> Knut Anders
>

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Mike Matrigali <mi...@sbcglobal.net> writes:

>> We should have a complete list of the data types that are impacted by
>> collation.
>> CHAR
>> VARCHAR
>> CLOB ?
> I believe it is
> CHAR
> VARCHAR
> LONG VARCHAR
> CLOB

How are LONG VARCHAR and CLOB affected by collation? It seems like Derby
doesn't support comparisons involving these data types. For instance,
when trying to order by a CLOB column, I see this error:

ERROR X0X67: Columns of type 'CLOB' may not be used in CREATE INDEX,
ORDER BY, GROUP BY, UNION, INTERSECT, EXCEPT or DISTINCT statements
because comparisons are not supported for that type.

-- 
Knut Anders

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Laura Stewart wrote:
> As part of adding the new attribute collation=TERRITORY_BASED, I think
> that we need to describe how Derby handles collation.
> 
> I am trying to get my head around the best way to describe collation
> in Derby... for 10.3
> 
> In general terms, a collating sequence is a defined ordering for
> character data that determines whether a particular character sorts
> higher, lower, or the same as another character.  Each character set
> will also have a default collation.
I would also not use character set.  I would approach documenting it 
based on the behavior of datatypes rather than talk about character 
sets.  So CHAR, VARCHAR, LONG VARCHAR and CLOB comparison/ordering/like 
processing is affected.

> 
> In Derby, it is my understanding that all of our string data types are
> represented as Unicode sequences.  Is that correct?
I believe the documentation should only speak to the datatypes rather
than the underlying storage structure.  To understand current 
implementation all operations on character types use either String or
java char in memory to perform operations.  JDBC defines how one inputs
data into the datatypes and retrieves data from the datatypes.
> 
> We should have a complete list of the data types that are impacted by
> collation.
> CHAR
> VARCHAR
> CLOB ?
I believe it is
CHAR
VARCHAR
LONG VARCHAR
CLOB
> 
> Does Derby support the national character datatypes such as 
> NCHAR/NVARCHAR2?
No.
> 
> FYI - there is a feeling among some in the Internet community that the
> term "character set" is not appropriate.  They tout character code,
> character encoding, or character repertoire.
> 
> Does Derby support specifying codes?  Is that what the attribute
> territory=l_CCI (example territory=es_MX) does?
> 
> Is there a complete listing of the territories that are supported...
> maybe in a Java spec?
Hopefully mamta can expand here.  I hope that we can define our support
in terms of the standard interfaces we are using from java to perform
the ordering if a database has been defined to order based on it's
territory.

I don't believe 10.3 will change the territories supported, it is the 
same set as 10.2 (basically we support what java supports).  10.3 just
allows collation to be based on territory, all other territory support
is unchanged.
> 
> When you create a database, can you specify that the
> default character set for CHAR columns be ASCII, and the character set
> used for NCHAR be UTF8?
No there is no such thing.  We are not specifying a character set.  You
specify a teritory, this is existing functionality in 10.2.  In 10.3 you
specify at database creation time if you want collation of all user 
character data to be determined by the territory or not.  In the current
implementation it does not change the storage format, but I don't think
that should be part of the documentation.

Do not get confused by what other databases may have to include in such
a change.  Derby has always used java String/char support which is 
unicode based, so no difference is needed to operate on non-ascii 
character data.  How Derby chooses to read/write those characters to
disk is even less important for user interface documentation and could
be changed in the future.  We happen to currently use a modified UTF8
scheme (modified to support very long strings), but that is never 
exposed to a user.
> 
> The Derby documentation mentions code sets, but only with relationship
> to import/export topics or ij sessions...
right.  The 10.3 functionality does not change any of this, it only 
affects the ordering within the server.  Different operating systems,
environments may operate on different codesets outside of derby - but
once the data has gotten in (through an import, ij, jdbc) then data
is treated same on all systems.  On exit (export, ij, jdbc) the data
may then get transformed to a native codeset.  None of this is affected
by the 10.3 collation changes.
> 
> Any insite that you can provide on this would be appreciated.
>

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Laura Stewart <sc...@gmail.com>.

A couple of followup issues:

So I create a database with territory=en_US and collation=TERRITORY_BASED.

We say that that collation is based on the territory (in this case
english from the United States).  How does the user find out what that
means?  Is there some standard (Java or otherwise) that we can point
them to?

If the territory attbribute is set, but not the collation attribute,
we have indicated that the collation is the default Unicode codepoint
collation (UCS_BASIC). I googled "Unicode codepoint collation" and the
primary reference is to XML When I googled UCS_BASIC the references
are to Derby. Seems that we are using a non standard term here... The
unicode web site uses "Unicode Collation Algorithm"
http://unicode.org/reports/tr10/
Section 1.9 says
"1.9 The Unicode Collation Algorithm
The Unicode Collation Algorithm (UCA) provides a specification for how
to compare two Unicode strings while remaining conformant to the
requirements of The Unicode Standard. The UCA also supplies the
Default Unicode Collation Element Table (DUCET), which is data
specifying the default collation order for all Unicode characters.
This table is designed so that it can be tailored to meet the
requirements of different languages and customizations."

Do we support UCA?  Is it customized for Derby/Java?

It would be great to be able to use a well-defined term.

Is Derby's collation "well defined" by either the SQL and the Unicode standards?


-- 
Laura Stewart

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Laura Stewart <sc...@gmail.com>.

I've posted a html document (cdevcollation.html) in Derby-2377 that is
a DRAFT attempt to capture the collation support in Derby.  Please
review and help me understand what else should be said in this concept
topic.

-- 
Laura Stewart

Re: Another collation question - Derby-1478 and Derby-2377

Posted by Daniel John Debrunner <dj...@apache.org>.

Laura Stewart wrote:
> As part of adding the new attribute collation=TERRITORY_BASED, I think
> that we need to describe how Derby handles collation.
> 
> I am trying to get my head around the best way to describe collation
> in Derby... for 10.3
> 
> In general terms, a collating sequence is a defined ordering for
> character data that determines whether a particular character sorts
> higher, lower, or the same as another character.  Each character set
> will also have a default collation.
> 
> In Derby, it is my understanding that all of our string data types are
> represented as Unicode sequences.  Is that correct?

Yes, and I think that fact should be documented, at least in the ref 
guide section for each string data type, CHAR, VARCHAR, LONG VARCHAR and 
CLOB. I think also specifically that it is Unicode 2.0 which is the 
version supported by Java 1.1. I don't think any changes have been made 
to support later versions of Unicode, and I don't know what changes, if 
any, would be required.

Dan.