You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Mamta Satoor <ms...@gmail.com> on 2007/10/09 18:40:53 UTC

Territory based database: proposed change to current LIKE behavior vs = for character string types

Hi,

I am currently working on DERBY-2967. This issue involves determining the
behavior of LIKE operation for character string types in a territory based
database. I plan on implementing the findings of DERBY-2967 but wanted to
provide a high-level behavior description here to see if anyone has any
feedback on the proposed behavior.

Unicode has a concept of Contraction where a user might perceive more than
one character as a single character in a given language. One eg of this
would be 'AA' in Norwegian locale. Although, this Contraction is made of
2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as
a single character. In addition, in Norwegian, the collation elements for
'AA' are identical to collation elements for 'Å'. So, the question is what
should SQL operation 'AA' like 'Å'  return? Also, is that behavior same
as SQL operation 'AA' = 'Å' ?

Say there is a table t1 with 2 rows in it.

Now taking an eg based on table t1 for LIKE
select * from t1 where 'AA' like 'Å'
The query above will return no rows because as per SQL standard, LIKE does
not treat Contractions as one single character. Instead, we work with one
Unicode character at a time in LIKE. So, for the given eg, the first Unicdoe
character in 'AA' which is 'A' does not match with 'Å' and hence no rows
would be returned.

Now taking an eg based on table t1 for =
select * from t1 where 'AA' = 'Å'
The query above would return 2 rows because as per SQL standard, in =
operation, we take the entire character string and use the locale to
generate the collation elements for that string. And then we compare the
collation elements for the 2 sides of the = operation. Since the 'AA' and
'Å' both result into identical collation elements, the where clause ends up
being TRUE and hence we return 2 rows.

Expanding further on LIKE operation, the metacharacter _ will match exactly
one Unicode character(and not Contractions). For eg select * from t1 where
'AA' like '_' will return 0 rows because _ consumes only the first A in 'AA'
and we still have one A left which is not consumed by the pattern. In order
for the WHERE clause to be true, the SQL should look like select * from t1
where 'AA' like '__'. Same rule applies for metacharacter %. It will treat
'AA' in LIKE as 2 distinct characters. For eg, select * from t1 where 'AAB'
like '%AB' will return 2 rows because % consumed only the first A (and not
the Contraction AA). Another eg select * from t1 where 'BAAC' like 'BA%C'
will return 2 rows because % consumed the 2nd A.

I hope this mail helps understand the behavior of LIKE in a territory based
database. I plan to work on implementing this behavior in next few days but
please let me know if anyone has any questions that I can help clarify.

thanks,
Mamta

Re: Territory based database: proposed change to current LIKE behavior vs = for character string types

Posted by Daniel John Debrunner <dj...@apache.org>.

Bernt M. Johnsen wrote:

> Does anyone know what MySQL, Oracle, SQL Server etc does with this? If
> the other major databases are in agreement, we could follow their
> interpretation if we find it reasonable.

MySQL has that LIKE and = can return different results.

See comments in DERBY-2967 and

http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html

Dan.

Re: Territory based database: proposed change to current LIKE behavior vs = for character string types

Posted by Mamta Satoor <ms...@gmail.com>.

My quick look at other products' documentation has not given me any
specifics on their behavior on this issue. I will concentrate my efforts on
coming up with a patch which is based on SQL standards. But please feel free
to have this discussion going.

Mamta


On 10/10/07, Mamta Satoor <ms...@gmail.com> wrote:
>
> Dan posted his findings on MySQL behavior on DERBY-2967. The MySQL
> Reference Manaul (MySQL reference: http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html
> ), towards the beginning of the page has following blurb "Per the SQL
> standard, LIKE performs matching on a per-character basis, thus it can
> produce results different from the = comparison operator: "
>
> I do not know what is Oracle's and SQL Server's behavior here. I will try
> to look at their documentation.
>
> Mamta
>
>  On 10/10/07, Bernt M. Johnsen <Be...@sun.com> wrote:
> >
> > >>>>>>>>>>>> Mamta Satoor wrote (2007-10-09 09:40:53):
> > > [...]
> > > Unicode has a concept of Contraction where a user might perceive more
> > than
> > > one character as a single character in a given language. One eg of
> > this
> > > would be 'AA' in Norwegian locale. Although, this Contraction is made
> > of
> > > 2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as
> > > a single character.
> >
> > No. Noone in Norway would perceive 'aa' as a single character.
> >
> > > In addition, in Norwegian, the collation elements for 'AA' are
> > > identical to collation elements for 'Å'. So, the question is what
> > > should SQL operation 'AA' like 'Å' return? Also, is that behavior
> > > same as SQL operation 'AA' = 'Å' ?
> >
> > 'AA' and 'Å' will in Norwegian have the same collation weight, but the
> > way the SQL standard is phrased, that does not apply to LIKE which
> > does an character by character match of the strings. I admit that
> >
> > 'aa'= 'å'
> >
> > and
> >
> > 'aa' LIKE 'å'
> >
> > giving different results may seem a bit odd but I am not able to
> > interpret Sections 8.2 and 8.5 differently (On the other hand, that's
> > not the only place the SQL spec is contraintuitive for the average
> > programmer).
> >
> > On the other hand, searching and matching are handled in Unicode T10
> > sections 1.5 and 8 (SELECT is explicitely mentioned in 1.5). It seems
> > that Unicode is in agreement with how I interpret the semantics for
> > comparison operators but not in agreement with how I interpret the
> > sematics for the LIKE predicate.
> >
> > Does anyone know what MySQL, Oracle, SQL Server etc does with this? If
> > the other major databases are in agreement, we could follow their
> > interpretation if we find it reasonable.
> > --
> > Bernt Marius Johnsen, Database Technology Group,
> > Staff Engineer, Derby/Java DB
> > Sun Microsystems, Trondheim, Norway
> >
> >
>

Re: Territory based database: proposed change to current LIKE behavior vs = for character string types

Posted by Mamta Satoor <ms...@gmail.com>.

Dan posted his findings on MySQL behavior on DERBY-2967. The MySQL Reference
Manaul (MySQL reference:
http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html ),
towards the beginning of the page has following blurb "Per the SQL standard,
LIKE performs matching on a per-character basis, thus it can produce results
different from the = comparison operator: "

I do not know what is Oracle's and SQL Server's behavior here. I will try to
look at their documentation.

Mamta

On 10/10/07, Bernt M. Johnsen <Be...@sun.com> wrote:
>
> >>>>>>>>>>>> Mamta Satoor wrote (2007-10-09 09:40:53):
> > [...]
> > Unicode has a concept of Contraction where a user might perceive more
> than
> > one character as a single character in a given language. One eg of this
> > would be 'AA' in Norwegian locale. Although, this Contraction is made of
> > 2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as
> > a single character.
>
> No. Noone in Norway would perceive 'aa' as a single character.
>
> > In addition, in Norwegian, the collation elements for 'AA' are
> > identical to collation elements for 'Å'. So, the question is what
> > should SQL operation 'AA' like 'Å' return? Also, is that behavior
> > same as SQL operation 'AA' = 'Å' ?
>
> 'AA' and 'Å' will in Norwegian have the same collation weight, but the
> way the SQL standard is phrased, that does not apply to LIKE which
> does an character by character match of the strings. I admit that
>
> 'aa'= 'å'
>
> and
>
> 'aa' LIKE 'å'
>
> giving different results may seem a bit odd but I am not able to
> interpret Sections 8.2 and 8.5 differently (On the other hand, that's
> not the only place the SQL spec is contraintuitive for the average
> programmer).
>
> On the other hand, searching and matching are handled in Unicode T10
> sections 1.5 and 8 (SELECT is explicitely mentioned in 1.5). It seems
> that Unicode is in agreement with how I interpret the semantics for
> comparison operators but not in agreement with how I interpret the
> sematics for the LIKE predicate.
>
> Does anyone know what MySQL, Oracle, SQL Server etc does with this? If
> the other major databases are in agreement, we could follow their
> interpretation if we find it reasonable.
> --
> Bernt Marius Johnsen, Database Technology Group,
> Staff Engineer, Derby/Java DB
> Sun Microsystems, Trondheim, Norway
>
>

Re: Territory based database: proposed change to current LIKE behavior vs = for character string types

Posted by "Bernt M. Johnsen" <Be...@Sun.COM>.

>>>>>>>>>>>> Mamta Satoor wrote (2007-10-09 09:40:53):
> [...]
> Unicode has a concept of Contraction where a user might perceive more than
> one character as a single character in a given language. One eg of this
> would be 'AA' in Norwegian locale. Although, this Contraction is made of
> 2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as
> a single character.

No. Noone in Norway would perceive 'aa' as a single character.

> In addition, in Norwegian, the collation elements for 'AA' are
> identical to collation elements for 'Å'. So, the question is what
> should SQL operation 'AA' like 'Å' return? Also, is that behavior
> same as SQL operation 'AA' = 'Å' ?

'AA' and 'Å' will in Norwegian have the same collation weight, but the
way the SQL standard is phrased, that does not apply to LIKE which
does an character by character match of the strings. I admit that 

'aa'= 'å' 

and 

'aa' LIKE 'å' 

giving different results may seem a bit odd but I am not able to
interpret Sections 8.2 and 8.5 differently (On the other hand, that's
not the only place the SQL spec is contraintuitive for the average
programmer). 

On the other hand, searching and matching are handled in Unicode T10
sections 1.5 and 8 (SELECT is explicitely mentioned in 1.5). It seems
that Unicode is in agreement with how I interpret the semantics for
comparison operators but not in agreement with how I interpret the
sematics for the LIKE predicate.

Does anyone know what MySQL, Oracle, SQL Server etc does with this? If
the other major databases are in agreement, we could follow their
interpretation if we find it reasonable.
-- 
Bernt Marius Johnsen, Database Technology Group, 
Staff Engineer, Derby/Java DB
Sun Microsystems, Trondheim, Norway

Re: Territory based database: proposed change to current LIKE behavior vs = for character string types

Posted by Mamta Satoor <ms...@gmail.com>.

I should mention that the actual contents of the table t1 does not matter
here because my examples use character string constants in the WHERE clause
and not the columns from the table t1. The only role table t1 has is if the
2 rows from it get selected or not depending on what the WHERE clause
returns.

Mamta


On 10/9/07, Mamta Satoor <ms...@gmail.com> wrote:
>
> Hi,
>
> I am currently working on DERBY-2967. This issue involves determining the
> behavior of LIKE operation for character string types in a territory based
> database. I plan on implementing the findings of DERBY-2967 but wanted to
> provide a high-level behavior description here to see if anyone has any
> feedback on the proposed behavior.
>
> Unicode has a concept of Contraction where a user might perceive more than
> one character as a single character in a given language. One eg of this
> would be 'AA' in Norwegian locale. Although, this Contraction is made of
> 2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as
> a single character. In addition, in Norwegian, the collation elements for
> 'AA' are identical to collation elements for 'Å'. So, the question is what
> should SQL operation 'AA' like 'Å'  return? Also, is that behavior same
> as SQL operation 'AA' = 'Å' ?
>
> Say there is a table t1 with 2 rows in it.
>
> Now taking an eg based on table t1 for LIKE
> select * from t1 where 'AA' like 'Å'
> The query above will return no rows because as per SQL standard, LIKE does
> not treat Contractions as one single character. Instead, we work with one
> Unicode character at a time in LIKE. So, for the given eg, the first Unicdoe
> character in 'AA' which is 'A' does not match with 'Å' and hence no rows
> would be returned.
>
> Now taking an eg based on table t1 for =
> select * from t1 where 'AA' = 'Å'
> The query above would return 2 rows because as per SQL standard, in =
> operation, we take the entire character string and use the locale to
> generate the collation elements for that string. And then we compare the
> collation elements for the 2 sides of the = operation. Since the 'AA' and
> 'Å' both result into identical collation elements, the where clause ends up
> being TRUE and hence we return 2 rows.
>
> Expanding further on LIKE operation, the metacharacter _ will match
> exactly one Unicode character(and not Contractions). For eg select * from t1
> where 'AA' like '_' will return 0 rows because _ consumes only the first A
> in 'AA' and we still have one A left which is not consumed by the pattern.
> In order for the WHERE clause to be true, the SQL should look like select *
> from t1 where 'AA' like '__'. Same rule applies for metacharacter %. It will
> treat 'AA' in LIKE as 2 distinct characters. For eg, select * from t1 where
> 'AAB' like '%AB' will return 2 rows because % consumed only the first A (and
> not the Contraction AA). Another eg select * from t1 where 'BAAC' like
> 'BA%C' will return 2 rows because % consumed the 2nd A.
>
> I hope this mail helps understand the behavior of LIKE in a territory
> based database. I plan to work on implementing this behavior in next few
> days but please let me know if anyone has any questions that I can help
> clarify.
>
> thanks,
> Mamta
>