You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Rick Hillegas <Ri...@Sun.COM> on 2008/08/22 19:16:55 UTC

territory-based collations and optimizations for the LIKE operator

According to the Developer's Guide section titled "Character-based 
collation in Derby", the LIKE operator behaves differently than the = 
operator. That is, if you are using a territory-based collation, two 
strings might = one another but not be LIKE one another (and vice-versa).

In the meantime, according to the section in the Tuning Guide titled 
"Character string beginning with constant", LIKE expressions which begin 
with a constant (followed by a wildcard) can be transformed into 
indexable expressions involving the ">=" and "<" operators.

To my muddled understanding, it seems that one of the following occurs:

1) The LIKE optimizations are disabled if you are using territory-based 
collation

or

2) a LIKE expression may return different results depending on your 
indexes and/or the presence of trailing wildcards in your LIKE expression.

(1) seems like a performance issue which needs to be documented. (2) 
seems like a correctness problem. What is the behavior that we expect?

Thanks,
-Rick

Re: territory-based collations and optimizations for the LIKE operator

Posted by Mamta Satoor <ms...@gmail.com>.
Rick, let me start out by answering your 1st question about LIKE
optimization. Yes, we do disable LIKE optimization if we are using
territory based collation. That behavior is the jira entry (DERBY-1478
 disable dynamic like optimization for collated databases)

I will need to do little more reading before I can answer question 2).

Mamta

On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
> According to the Developer's Guide section titled "Character-based collation
> in Derby", the LIKE operator behaves differently than the = operator. That
> is, if you are using a territory-based collation, two strings might = one
> another but not be LIKE one another (and vice-versa).
>
> In the meantime, according to the section in the Tuning Guide titled
> "Character string beginning with constant", LIKE expressions which begin
> with a constant (followed by a wildcard) can be transformed into indexable
> expressions involving the ">=" and "<" operators.
>
> To my muddled understanding, it seems that one of the following occurs:
>
> 1) The LIKE optimizations are disabled if you are using territory-based
> collation
>
> or
>
> 2) a LIKE expression may return different results depending on your indexes
> and/or the presence of trailing wildcards in your LIKE expression.
>
> (1) seems like a performance issue which needs to be documented. (2) seems
> like a correctness problem. What is the behavior that we expect?
>
> Thanks,
> -Rick
>

Re: territory-based collations and optimizations for the LIKE operator

Posted by Rick Hillegas <Ri...@Sun.COM>.
Knut Anders Hatlen wrote:
> Daniel John Debrunner <dj...@apache.org> writes:
>
>   
>> Rick Hillegas wrote:
>>
>>     
>>>> Rick, why do you believe that Derby will return a subset of the
>>>> rows, it doesn't seem to follow from any of the other posts in this
>>>> thread?
>>>>
>>>> Dan.
>>>>         
>>> Hi Dan,
>>>
>>> Thanks for helping me puzzle through this. Consider the example we
>>> are working with, a collation in which 'z' = 'xy'. What happens with
>>> the following insert:
>>>
>>>  insert into T(A) values ( 'zcb' ), ( 'xycb' )
>>>
>>> followed by this query:
>>>
>>>  select * from T where T.A like 'xy_b'
>>>
>>> If I  understand what is being said, it seems like the following
>>> happens: the ANSI rules return both rows but Derby returns only (
>>> xycb' )
>>>       
>> I thought Derby was following the ANSI rules for LIKE. From memory,
>> LIKE operators on a character by character basis, not collation
>> elements.
>>     
>
> Yes, I remember that we discussed this and came to that conclusion. See
> ISO/IEC 9075-2:2003 (E), Section 8.5 <like predicate>, general rule 3c.
>
>   
Hi Dan and Knut,

Thanks for that pointer. I'm tracking now.

Regards,
-Rick

Re: territory-based collations and optimizations for the LIKE operator

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.
Daniel John Debrunner <dj...@apache.org> writes:

> Rick Hillegas wrote:
>
>>> Rick, why do you believe that Derby will return a subset of the
>>> rows, it doesn't seem to follow from any of the other posts in this
>>> thread?
>>>
>>> Dan.
>> Hi Dan,
>>
>> Thanks for helping me puzzle through this. Consider the example we
>> are working with, a collation in which 'z' = 'xy'. What happens with
>> the following insert:
>>
>>  insert into T(A) values ( 'zcb' ), ( 'xycb' )
>>
>> followed by this query:
>>
>>  select * from T where T.A like 'xy_b'
>>
>> If I  understand what is being said, it seems like the following
>> happens: the ANSI rules return both rows but Derby returns only (
>> xycb' )
>
> I thought Derby was following the ANSI rules for LIKE. From memory,
> LIKE operators on a character by character basis, not collation
> elements.

Yes, I remember that we discussed this and came to that conclusion. See
ISO/IEC 9075-2:2003 (E), Section 8.5 <like predicate>, general rule 3c.

-- 
Knut Anders

Re: territory-based collations and optimizations for the LIKE operator

Posted by Daniel John Debrunner <dj...@apache.org>.
Rick Hillegas wrote:

>> Rick, why do you believe that Derby will return a subset of the rows, 
>> it doesn't seem to follow from any of the other posts in this thread?
>>
>> Dan.
> Hi Dan,
> 
> Thanks for helping me puzzle through this. Consider the example we are 
> working with, a collation in which 'z' = 'xy'. What happens with the 
> following insert:
> 
>  insert into T(A) values ( 'zcb' ), ( 'xycb' )
> 
> followed by this query:
> 
>  select * from T where T.A like 'xy_b'
> 
> If I  understand what is being said, it seems like the following 
> happens: the ANSI rules return both rows but Derby returns only ( 'xycb' )

I thought Derby was following the ANSI rules for LIKE. From memory, LIKE 
operators on a character by character basis, not collation elements.

Dan.

Re: territory-based collations and optimizations for the LIKE operator

Posted by Rick Hillegas <Ri...@Sun.COM>.
Daniel John Debrunner wrote:
> Mamta Satoor wrote:
>
>>  > On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
>
> >> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
> >> semantics return a subset of the rows which qualify under the ANSI 
> rules.
>
>>
>> For 2), we have jira entry DERBY-2793
>
> DERBY-2793 is not about incorrect rows being returned is it, it's 
> about following the rules for when LIKE can be used with different 
> collations?
>
> Rick, why do you believe that Derby will return a subset of the rows, 
> it doesn't seem to follow from any of the other posts in this thread?
>
> Dan.
Hi Dan,

Thanks for helping me puzzle through this. Consider the example we are 
working with, a collation in which 'z' = 'xy'. What happens with the 
following insert:

  insert into T(A) values ( 'zcb' ), ( 'xycb' )

followed by this query:

  select * from T where T.A like 'xy_b'

If I  understand what is being said, it seems like the following 
happens: the ANSI rules return both rows but Derby returns only ( 'xycb' )

Thanks,
-Rick

Re: territory-based collations and optimizations for the LIKE operator

Posted by Daniel John Debrunner <dj...@apache.org>.
Mamta Satoor wrote:

>  > On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:

 >> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
 >> semantics return a subset of the rows which qualify under the ANSI 
rules.

> 
> For 2), we have jira entry DERBY-2793

DERBY-2793 is not about incorrect rows being returned is it, it's about 
following the rules for when LIKE can be used with different collations?

Rick, why do you believe that Derby will return a subset of the rows, it 
doesn't seem to follow from any of the other posts in this thread?

Dan.

Re: territory-based collations and optimizations for the LIKE operator

Posted by Rick Hillegas <Ri...@Sun.COM>.
Thanks, Mamta. I'm having a hard time seeing these issues addressed by 
the descriptions of DERBY-1478 and DERBY-2793. Do you think it might 
make sense to log a new issue and link it to these old ones?

Thanks,
-Rick

Mamta Satoor wrote:
> Yes, that is correct.
>
> For 1), the disabling of optimizaiton for LIKE for a territory based
> database has jira entry DERBY-1478 for it.
>
> For 2), we have jira entry DERBY-2793
>
> Mamta
>
>
> On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
>   
>> Thanks, Mamta. This is very helpful. Based on your responses, I think that
>> the following summarizes how LIKE behaves in databases with territory based
>> collations:
>>
>> 1) The LIKE optimizations are disabled. For example, given the following
>> query where A is an indexed string column
>>
>>  select * from T where T.A like 'foo%'
>>
>> Derby cannot  use the index on A and falls back on performing a full table
>> scan.
>>
>> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
>> semantics return a subset of the rows which qualify under the ANSI rules.
>>
>> Does that sound correct to you?
>>
>> Thanks,
>> -Rick
>>
>>
>> Mamta Satoor wrote:
>>     
>>> Rick, the behavior we expect for LIKE and = in a territory based
>>> database can be explained by the following comment that I picked up
>>> from DERBY-3166
>>>
>>>
>>>       
>> **********************************************************
>>     
>>> The following applies to territory based database
>>> For LIKE, when comparing a pattern against a value string, we do the
>>> comparison of collation elements(s) for one character at a time for
>>> non-metacharacters. This is different than what is done for =
>>> operation. For =, we compare the collation elements for the entire
>>> string on left hand with the collation elements of the entire string
>>> on the right hand side.
>>> For eg say we are working with a territory where character 'z' has
>>> same collation elements as 'xy'. For such a territory consider 2
>>> clauses in WHERE clause
>>> 1)'zcb' = 'xycb'
>>> 2)'zcb' LIKE 'xy_b'
>>> For case 1), we will return TRUE because the collation elements for
>>> the entire string 'zcb' will match the collation elements of the
>>> entire string 'xycb'.
>>> For case 2) though, we will return FALSE because collation element(s)
>>> for character 'z' does not match the collation element(s) for
>>> character 'x'. So, as can be seen, the LIKE operation is one character
>>> at a time whereas = operation is the entire string at a time.
>>> In addition, the metacharacter _ in pattern for LIKE will consume
>>> *one* character in the string value. So for an eg clause 'xycb' LIKE
>>> '_cb' will return FALSE because metacharacter _ will consume 'x' and
>>> since 'c' does not match 'y', we will return FALSE.
>>>
>>>       
>> **********************************************************
>>     
>>> Is this the information you were looking for?
>>>
>>> thanks,
>>> Mamta
>>>
>>> On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
>>>
>>>
>>>       
>>>> According to the Developer's Guide section titled "Character-based
>>>>         
>> collation
>>     
>>>> in Derby", the LIKE operator behaves differently than the = operator.
>>>>         
>> That
>>     
>>>> is, if you are using a territory-based collation, two strings might =
>>>>         
>> one
>>     
>>>> another but not be LIKE one another (and vice-versa).
>>>>
>>>> In the meantime, according to the section in the Tuning Guide titled
>>>> "Character string beginning with constant", LIKE expressions which begin
>>>> with a constant (followed by a wildcard) can be transformed into
>>>>         
>> indexable
>>     
>>>> expressions involving the ">=" and "<" operators.
>>>>
>>>> To my muddled understanding, it seems that one of the following occurs:
>>>>
>>>> 1) The LIKE optimizations are disabled if you are using territory-based
>>>> collation
>>>>
>>>> or
>>>>
>>>> 2) a LIKE expression may return different results depending on your
>>>>         
>> indexes
>>     
>>>> and/or the presence of trailing wildcards in your LIKE expression.
>>>>
>>>> (1) seems like a performance issue which needs to be documented. (2)
>>>>         
>> seems
>>     
>>>> like a correctness problem. What is the behavior that we expect?
>>>>
>>>> Thanks,
>>>> -Rick
>>>>
>>>>
>>>>
>>>>         
>>     


Re: territory-based collations and optimizations for the LIKE operator

Posted by Mamta Satoor <ms...@gmail.com>.
Yes, that is correct.

For 1), the disabling of optimizaiton for LIKE for a territory based
database has jira entry DERBY-1478 for it.

For 2), we have jira entry DERBY-2793

Mamta


On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
> Thanks, Mamta. This is very helpful. Based on your responses, I think that
> the following summarizes how LIKE behaves in databases with territory based
> collations:
>
> 1) The LIKE optimizations are disabled. For example, given the following
> query where A is an indexed string column
>
>  select * from T where T.A like 'foo%'
>
> Derby cannot  use the index on A and falls back on performing a full table
> scan.
>
> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
> semantics return a subset of the rows which qualify under the ANSI rules.
>
> Does that sound correct to you?
>
> Thanks,
> -Rick
>
>
> Mamta Satoor wrote:
> > Rick, the behavior we expect for LIKE and = in a territory based
> > database can be explained by the following comment that I picked up
> > from DERBY-3166
> >
> >
> **********************************************************
> > The following applies to territory based database
> > For LIKE, when comparing a pattern against a value string, we do the
> > comparison of collation elements(s) for one character at a time for
> > non-metacharacters. This is different than what is done for =
> > operation. For =, we compare the collation elements for the entire
> > string on left hand with the collation elements of the entire string
> > on the right hand side.
> > For eg say we are working with a territory where character 'z' has
> > same collation elements as 'xy'. For such a territory consider 2
> > clauses in WHERE clause
> > 1)'zcb' = 'xycb'
> > 2)'zcb' LIKE 'xy_b'
> > For case 1), we will return TRUE because the collation elements for
> > the entire string 'zcb' will match the collation elements of the
> > entire string 'xycb'.
> > For case 2) though, we will return FALSE because collation element(s)
> > for character 'z' does not match the collation element(s) for
> > character 'x'. So, as can be seen, the LIKE operation is one character
> > at a time whereas = operation is the entire string at a time.
> > In addition, the metacharacter _ in pattern for LIKE will consume
> > *one* character in the string value. So for an eg clause 'xycb' LIKE
> > '_cb' will return FALSE because metacharacter _ will consume 'x' and
> > since 'c' does not match 'y', we will return FALSE.
> >
> **********************************************************
> >
> > Is this the information you were looking for?
> >
> > thanks,
> > Mamta
> >
> > On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
> >
> >
> > > According to the Developer's Guide section titled "Character-based
> collation
> > > in Derby", the LIKE operator behaves differently than the = operator.
> That
> > > is, if you are using a territory-based collation, two strings might =
> one
> > > another but not be LIKE one another (and vice-versa).
> > >
> > > In the meantime, according to the section in the Tuning Guide titled
> > > "Character string beginning with constant", LIKE expressions which begin
> > > with a constant (followed by a wildcard) can be transformed into
> indexable
> > > expressions involving the ">=" and "<" operators.
> > >
> > > To my muddled understanding, it seems that one of the following occurs:
> > >
> > > 1) The LIKE optimizations are disabled if you are using territory-based
> > > collation
> > >
> > > or
> > >
> > > 2) a LIKE expression may return different results depending on your
> indexes
> > > and/or the presence of trailing wildcards in your LIKE expression.
> > >
> > > (1) seems like a performance issue which needs to be documented. (2)
> seems
> > > like a correctness problem. What is the behavior that we expect?
> > >
> > > Thanks,
> > > -Rick
> > >
> > >
> > >
> >
>
>

Re: territory-based collations and optimizations for the LIKE operator

Posted by Rick Hillegas <Ri...@Sun.COM>.
Thanks, Mamta. This is very helpful. Based on your responses, I think 
that the following summarizes how LIKE behaves in databases with 
territory based collations:

1) The LIKE optimizations are disabled. For example, given the following 
query where A is an indexed string column

   select * from T where T.A like 'foo%'

Derby cannot  use the index on A and falls back on performing a full 
table scan.

2) The LIKE operator has Derby-specific semantics. The Derby-specifics 
semantics return a subset of the rows which qualify under the ANSI rules.

Does that sound correct to you?

Thanks,
-Rick

Mamta Satoor wrote:
> Rick, the behavior we expect for LIKE and = in a territory based
> database can be explained by the following comment that I picked up
> from DERBY-3166
>
> **********************************************************
> The following applies to territory based database
> For LIKE, when comparing a pattern against a value string, we do the
> comparison of collation elements(s) for one character at a time for
> non-metacharacters. This is different than what is done for =
> operation. For =, we compare the collation elements for the entire
> string on left hand with the collation elements of the entire string
> on the right hand side.
> For eg say we are working with a territory where character 'z' has
> same collation elements as 'xy'. For such a territory consider 2
> clauses in WHERE clause
> 1)'zcb' = 'xycb'
> 2)'zcb' LIKE 'xy_b'
> For case 1), we will return TRUE because the collation elements for
> the entire string 'zcb' will match the collation elements of the
> entire string 'xycb'.
> For case 2) though, we will return FALSE because collation element(s)
> for character 'z' does not match the collation element(s) for
> character 'x'. So, as can be seen, the LIKE operation is one character
> at a time whereas = operation is the entire string at a time.
> In addition, the metacharacter _ in pattern for LIKE will consume
> *one* character in the string value. So for an eg clause 'xycb' LIKE
> '_cb' will return FALSE because metacharacter _ will consume 'x' and
> since 'c' does not match 'y', we will return FALSE.
> **********************************************************
>
> Is this the information you were looking for?
>
> thanks,
> Mamta
>
> On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
>   
>> According to the Developer's Guide section titled "Character-based collation
>> in Derby", the LIKE operator behaves differently than the = operator. That
>> is, if you are using a territory-based collation, two strings might = one
>> another but not be LIKE one another (and vice-versa).
>>
>> In the meantime, according to the section in the Tuning Guide titled
>> "Character string beginning with constant", LIKE expressions which begin
>> with a constant (followed by a wildcard) can be transformed into indexable
>> expressions involving the ">=" and "<" operators.
>>
>> To my muddled understanding, it seems that one of the following occurs:
>>
>> 1) The LIKE optimizations are disabled if you are using territory-based
>> collation
>>
>> or
>>
>> 2) a LIKE expression may return different results depending on your indexes
>> and/or the presence of trailing wildcards in your LIKE expression.
>>
>> (1) seems like a performance issue which needs to be documented. (2) seems
>> like a correctness problem. What is the behavior that we expect?
>>
>> Thanks,
>> -Rick
>>
>>     


Re: territory-based collations and optimizations for the LIKE operator

Posted by Mamta Satoor <ms...@gmail.com>.
Rick, the behavior we expect for LIKE and = in a territory based
database can be explained by the following comment that I picked up
from DERBY-3166

**********************************************************
The following applies to territory based database
For LIKE, when comparing a pattern against a value string, we do the
comparison of collation elements(s) for one character at a time for
non-metacharacters. This is different than what is done for =
operation. For =, we compare the collation elements for the entire
string on left hand with the collation elements of the entire string
on the right hand side.
For eg say we are working with a territory where character 'z' has
same collation elements as 'xy'. For such a territory consider 2
clauses in WHERE clause
1)'zcb' = 'xycb'
2)'zcb' LIKE 'xy_b'
For case 1), we will return TRUE because the collation elements for
the entire string 'zcb' will match the collation elements of the
entire string 'xycb'.
For case 2) though, we will return FALSE because collation element(s)
for character 'z' does not match the collation element(s) for
character 'x'. So, as can be seen, the LIKE operation is one character
at a time whereas = operation is the entire string at a time.
In addition, the metacharacter _ in pattern for LIKE will consume
*one* character in the string value. So for an eg clause 'xycb' LIKE
'_cb' will return FALSE because metacharacter _ will consume 'x' and
since 'c' does not match 'y', we will return FALSE.
**********************************************************

Is this the information you were looking for?

thanks,
Mamta

On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
> According to the Developer's Guide section titled "Character-based collation
> in Derby", the LIKE operator behaves differently than the = operator. That
> is, if you are using a territory-based collation, two strings might = one
> another but not be LIKE one another (and vice-versa).
>
> In the meantime, according to the section in the Tuning Guide titled
> "Character string beginning with constant", LIKE expressions which begin
> with a constant (followed by a wildcard) can be transformed into indexable
> expressions involving the ">=" and "<" operators.
>
> To my muddled understanding, it seems that one of the following occurs:
>
> 1) The LIKE optimizations are disabled if you are using territory-based
> collation
>
> or
>
> 2) a LIKE expression may return different results depending on your indexes
> and/or the presence of trailing wildcards in your LIKE expression.
>
> (1) seems like a performance issue which needs to be documented. (2) seems
> like a correctness problem. What is the behavior that we expect?
>
> Thanks,
> -Rick
>