You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Rick Hillegas <Ri...@Sun.COM> on 2008/08/22 19:16:55 UTC
territory-based collations and optimizations for the LIKE operator
According to the Developer's Guide section titled "Character-based
collation in Derby", the LIKE operator behaves differently than the =
operator. That is, if you are using a territory-based collation, two
strings might = one another but not be LIKE one another (and vice-versa).
In the meantime, according to the section in the Tuning Guide titled
"Character string beginning with constant", LIKE expressions which begin
with a constant (followed by a wildcard) can be transformed into
indexable expressions involving the ">=" and "<" operators.
To my muddled understanding, it seems that one of the following occurs:
1) The LIKE optimizations are disabled if you are using territory-based
collation
or
2) a LIKE expression may return different results depending on your
indexes and/or the presence of trailing wildcards in your LIKE expression.
(1) seems like a performance issue which needs to be documented. (2)
seems like a correctness problem. What is the behavior that we expect?
Thanks,
-Rick
Re: territory-based collations and optimizations for the LIKE operator
Posted by Mamta Satoor <ms...@gmail.com>.
Rick, let me start out by answering your 1st question about LIKE
optimization. Yes, we do disable LIKE optimization if we are using
territory based collation. That behavior is the jira entry (DERBY-1478
disable dynamic like optimization for collated databases)
I will need to do little more reading before I can answer question 2).
Mamta
On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
> According to the Developer's Guide section titled "Character-based collation
> in Derby", the LIKE operator behaves differently than the = operator. That
> is, if you are using a territory-based collation, two strings might = one
> another but not be LIKE one another (and vice-versa).
>
> In the meantime, according to the section in the Tuning Guide titled
> "Character string beginning with constant", LIKE expressions which begin
> with a constant (followed by a wildcard) can be transformed into indexable
> expressions involving the ">=" and "<" operators.
>
> To my muddled understanding, it seems that one of the following occurs:
>
> 1) The LIKE optimizations are disabled if you are using territory-based
> collation
>
> or
>
> 2) a LIKE expression may return different results depending on your indexes
> and/or the presence of trailing wildcards in your LIKE expression.
>
> (1) seems like a performance issue which needs to be documented. (2) seems
> like a correctness problem. What is the behavior that we expect?
>
> Thanks,
> -Rick
>
Re: territory-based collations and optimizations for the LIKE operator
Posted by Rick Hillegas <Ri...@Sun.COM>.
Knut Anders Hatlen wrote:
> Daniel John Debrunner <dj...@apache.org> writes:
>
>
>> Rick Hillegas wrote:
>>
>>
>>>> Rick, why do you believe that Derby will return a subset of the
>>>> rows, it doesn't seem to follow from any of the other posts in this
>>>> thread?
>>>>
>>>> Dan.
>>>>
>>> Hi Dan,
>>>
>>> Thanks for helping me puzzle through this. Consider the example we
>>> are working with, a collation in which 'z' = 'xy'. What happens with
>>> the following insert:
>>>
>>> insert into T(A) values ( 'zcb' ), ( 'xycb' )
>>>
>>> followed by this query:
>>>
>>> select * from T where T.A like 'xy_b'
>>>
>>> If I understand what is being said, it seems like the following
>>> happens: the ANSI rules return both rows but Derby returns only (
>>> xycb' )
>>>
>> I thought Derby was following the ANSI rules for LIKE. From memory,
>> LIKE operators on a character by character basis, not collation
>> elements.
>>
>
> Yes, I remember that we discussed this and came to that conclusion. See
> ISO/IEC 9075-2:2003 (E), Section 8.5 <like predicate>, general rule 3c.
>
>
Hi Dan and Knut,
Thanks for that pointer. I'm tracking now.
Regards,
-Rick
Re: territory-based collations and optimizations for the LIKE operator
Posted by Knut Anders Hatlen <Kn...@Sun.COM>.
Daniel John Debrunner <dj...@apache.org> writes:
> Rick Hillegas wrote:
>
>>> Rick, why do you believe that Derby will return a subset of the
>>> rows, it doesn't seem to follow from any of the other posts in this
>>> thread?
>>>
>>> Dan.
>> Hi Dan,
>>
>> Thanks for helping me puzzle through this. Consider the example we
>> are working with, a collation in which 'z' = 'xy'. What happens with
>> the following insert:
>>
>> insert into T(A) values ( 'zcb' ), ( 'xycb' )
>>
>> followed by this query:
>>
>> select * from T where T.A like 'xy_b'
>>
>> If I understand what is being said, it seems like the following
>> happens: the ANSI rules return both rows but Derby returns only (
>> xycb' )
>
> I thought Derby was following the ANSI rules for LIKE. From memory,
> LIKE operators on a character by character basis, not collation
> elements.
Yes, I remember that we discussed this and came to that conclusion. See
ISO/IEC 9075-2:2003 (E), Section 8.5 <like predicate>, general rule 3c.
--
Knut Anders
Re: territory-based collations and optimizations for the LIKE operator
Posted by Daniel John Debrunner <dj...@apache.org>.
Rick Hillegas wrote:
>> Rick, why do you believe that Derby will return a subset of the rows,
>> it doesn't seem to follow from any of the other posts in this thread?
>>
>> Dan.
> Hi Dan,
>
> Thanks for helping me puzzle through this. Consider the example we are
> working with, a collation in which 'z' = 'xy'. What happens with the
> following insert:
>
> insert into T(A) values ( 'zcb' ), ( 'xycb' )
>
> followed by this query:
>
> select * from T where T.A like 'xy_b'
>
> If I understand what is being said, it seems like the following
> happens: the ANSI rules return both rows but Derby returns only ( 'xycb' )
I thought Derby was following the ANSI rules for LIKE. From memory, LIKE
operators on a character by character basis, not collation elements.
Dan.
Re: territory-based collations and optimizations for the LIKE operator
Posted by Rick Hillegas <Ri...@Sun.COM>.
Daniel John Debrunner wrote:
> Mamta Satoor wrote:
>
>> > On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
>
> >> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
> >> semantics return a subset of the rows which qualify under the ANSI
> rules.
>
>>
>> For 2), we have jira entry DERBY-2793
>
> DERBY-2793 is not about incorrect rows being returned is it, it's
> about following the rules for when LIKE can be used with different
> collations?
>
> Rick, why do you believe that Derby will return a subset of the rows,
> it doesn't seem to follow from any of the other posts in this thread?
>
> Dan.
Hi Dan,
Thanks for helping me puzzle through this. Consider the example we are
working with, a collation in which 'z' = 'xy'. What happens with the
following insert:
insert into T(A) values ( 'zcb' ), ( 'xycb' )
followed by this query:
select * from T where T.A like 'xy_b'
If I understand what is being said, it seems like the following
happens: the ANSI rules return both rows but Derby returns only ( 'xycb' )
Thanks,
-Rick
Re: territory-based collations and optimizations for the LIKE operator
Posted by Daniel John Debrunner <dj...@apache.org>.
Mamta Satoor wrote:
> > On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
>> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
>> semantics return a subset of the rows which qualify under the ANSI
rules.
>
> For 2), we have jira entry DERBY-2793
DERBY-2793 is not about incorrect rows being returned is it, it's about
following the rules for when LIKE can be used with different collations?
Rick, why do you believe that Derby will return a subset of the rows, it
doesn't seem to follow from any of the other posts in this thread?
Dan.
Re: territory-based collations and optimizations for the LIKE operator
Posted by Rick Hillegas <Ri...@Sun.COM>.
Thanks, Mamta. I'm having a hard time seeing these issues addressed by
the descriptions of DERBY-1478 and DERBY-2793. Do you think it might
make sense to log a new issue and link it to these old ones?
Thanks,
-Rick
Mamta Satoor wrote:
> Yes, that is correct.
>
> For 1), the disabling of optimizaiton for LIKE for a territory based
> database has jira entry DERBY-1478 for it.
>
> For 2), we have jira entry DERBY-2793
>
> Mamta
>
>
> On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
>
>> Thanks, Mamta. This is very helpful. Based on your responses, I think that
>> the following summarizes how LIKE behaves in databases with territory based
>> collations:
>>
>> 1) The LIKE optimizations are disabled. For example, given the following
>> query where A is an indexed string column
>>
>> select * from T where T.A like 'foo%'
>>
>> Derby cannot use the index on A and falls back on performing a full table
>> scan.
>>
>> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
>> semantics return a subset of the rows which qualify under the ANSI rules.
>>
>> Does that sound correct to you?
>>
>> Thanks,
>> -Rick
>>
>>
>> Mamta Satoor wrote:
>>
>>> Rick, the behavior we expect for LIKE and = in a territory based
>>> database can be explained by the following comment that I picked up
>>> from DERBY-3166
>>>
>>>
>>>
>> **********************************************************
>>
>>> The following applies to territory based database
>>> For LIKE, when comparing a pattern against a value string, we do the
>>> comparison of collation elements(s) for one character at a time for
>>> non-metacharacters. This is different than what is done for =
>>> operation. For =, we compare the collation elements for the entire
>>> string on left hand with the collation elements of the entire string
>>> on the right hand side.
>>> For eg say we are working with a territory where character 'z' has
>>> same collation elements as 'xy'. For such a territory consider 2
>>> clauses in WHERE clause
>>> 1)'zcb' = 'xycb'
>>> 2)'zcb' LIKE 'xy_b'
>>> For case 1), we will return TRUE because the collation elements for
>>> the entire string 'zcb' will match the collation elements of the
>>> entire string 'xycb'.
>>> For case 2) though, we will return FALSE because collation element(s)
>>> for character 'z' does not match the collation element(s) for
>>> character 'x'. So, as can be seen, the LIKE operation is one character
>>> at a time whereas = operation is the entire string at a time.
>>> In addition, the metacharacter _ in pattern for LIKE will consume
>>> *one* character in the string value. So for an eg clause 'xycb' LIKE
>>> '_cb' will return FALSE because metacharacter _ will consume 'x' and
>>> since 'c' does not match 'y', we will return FALSE.
>>>
>>>
>> **********************************************************
>>
>>> Is this the information you were looking for?
>>>
>>> thanks,
>>> Mamta
>>>
>>> On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
>>>
>>>
>>>
>>>> According to the Developer's Guide section titled "Character-based
>>>>
>> collation
>>
>>>> in Derby", the LIKE operator behaves differently than the = operator.
>>>>
>> That
>>
>>>> is, if you are using a territory-based collation, two strings might =
>>>>
>> one
>>
>>>> another but not be LIKE one another (and vice-versa).
>>>>
>>>> In the meantime, according to the section in the Tuning Guide titled
>>>> "Character string beginning with constant", LIKE expressions which begin
>>>> with a constant (followed by a wildcard) can be transformed into
>>>>
>> indexable
>>
>>>> expressions involving the ">=" and "<" operators.
>>>>
>>>> To my muddled understanding, it seems that one of the following occurs:
>>>>
>>>> 1) The LIKE optimizations are disabled if you are using territory-based
>>>> collation
>>>>
>>>> or
>>>>
>>>> 2) a LIKE expression may return different results depending on your
>>>>
>> indexes
>>
>>>> and/or the presence of trailing wildcards in your LIKE expression.
>>>>
>>>> (1) seems like a performance issue which needs to be documented. (2)
>>>>
>> seems
>>
>>>> like a correctness problem. What is the behavior that we expect?
>>>>
>>>> Thanks,
>>>> -Rick
>>>>
>>>>
>>>>
>>>>
>>
Re: territory-based collations and optimizations for the LIKE operator
Posted by Mamta Satoor <ms...@gmail.com>.
Yes, that is correct.
For 1), the disabling of optimizaiton for LIKE for a territory based
database has jira entry DERBY-1478 for it.
For 2), we have jira entry DERBY-2793
Mamta
On 8/25/08, Rick Hillegas <Ri...@sun.com> wrote:
> Thanks, Mamta. This is very helpful. Based on your responses, I think that
> the following summarizes how LIKE behaves in databases with territory based
> collations:
>
> 1) The LIKE optimizations are disabled. For example, given the following
> query where A is an indexed string column
>
> select * from T where T.A like 'foo%'
>
> Derby cannot use the index on A and falls back on performing a full table
> scan.
>
> 2) The LIKE operator has Derby-specific semantics. The Derby-specifics
> semantics return a subset of the rows which qualify under the ANSI rules.
>
> Does that sound correct to you?
>
> Thanks,
> -Rick
>
>
> Mamta Satoor wrote:
> > Rick, the behavior we expect for LIKE and = in a territory based
> > database can be explained by the following comment that I picked up
> > from DERBY-3166
> >
> >
> **********************************************************
> > The following applies to territory based database
> > For LIKE, when comparing a pattern against a value string, we do the
> > comparison of collation elements(s) for one character at a time for
> > non-metacharacters. This is different than what is done for =
> > operation. For =, we compare the collation elements for the entire
> > string on left hand with the collation elements of the entire string
> > on the right hand side.
> > For eg say we are working with a territory where character 'z' has
> > same collation elements as 'xy'. For such a territory consider 2
> > clauses in WHERE clause
> > 1)'zcb' = 'xycb'
> > 2)'zcb' LIKE 'xy_b'
> > For case 1), we will return TRUE because the collation elements for
> > the entire string 'zcb' will match the collation elements of the
> > entire string 'xycb'.
> > For case 2) though, we will return FALSE because collation element(s)
> > for character 'z' does not match the collation element(s) for
> > character 'x'. So, as can be seen, the LIKE operation is one character
> > at a time whereas = operation is the entire string at a time.
> > In addition, the metacharacter _ in pattern for LIKE will consume
> > *one* character in the string value. So for an eg clause 'xycb' LIKE
> > '_cb' will return FALSE because metacharacter _ will consume 'x' and
> > since 'c' does not match 'y', we will return FALSE.
> >
> **********************************************************
> >
> > Is this the information you were looking for?
> >
> > thanks,
> > Mamta
> >
> > On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
> >
> >
> > > According to the Developer's Guide section titled "Character-based
> collation
> > > in Derby", the LIKE operator behaves differently than the = operator.
> That
> > > is, if you are using a territory-based collation, two strings might =
> one
> > > another but not be LIKE one another (and vice-versa).
> > >
> > > In the meantime, according to the section in the Tuning Guide titled
> > > "Character string beginning with constant", LIKE expressions which begin
> > > with a constant (followed by a wildcard) can be transformed into
> indexable
> > > expressions involving the ">=" and "<" operators.
> > >
> > > To my muddled understanding, it seems that one of the following occurs:
> > >
> > > 1) The LIKE optimizations are disabled if you are using territory-based
> > > collation
> > >
> > > or
> > >
> > > 2) a LIKE expression may return different results depending on your
> indexes
> > > and/or the presence of trailing wildcards in your LIKE expression.
> > >
> > > (1) seems like a performance issue which needs to be documented. (2)
> seems
> > > like a correctness problem. What is the behavior that we expect?
> > >
> > > Thanks,
> > > -Rick
> > >
> > >
> > >
> >
>
>
Re: territory-based collations and optimizations for the LIKE operator
Posted by Rick Hillegas <Ri...@Sun.COM>.
Thanks, Mamta. This is very helpful. Based on your responses, I think
that the following summarizes how LIKE behaves in databases with
territory based collations:
1) The LIKE optimizations are disabled. For example, given the following
query where A is an indexed string column
select * from T where T.A like 'foo%'
Derby cannot use the index on A and falls back on performing a full
table scan.
2) The LIKE operator has Derby-specific semantics. The Derby-specifics
semantics return a subset of the rows which qualify under the ANSI rules.
Does that sound correct to you?
Thanks,
-Rick
Mamta Satoor wrote:
> Rick, the behavior we expect for LIKE and = in a territory based
> database can be explained by the following comment that I picked up
> from DERBY-3166
>
> **********************************************************
> The following applies to territory based database
> For LIKE, when comparing a pattern against a value string, we do the
> comparison of collation elements(s) for one character at a time for
> non-metacharacters. This is different than what is done for =
> operation. For =, we compare the collation elements for the entire
> string on left hand with the collation elements of the entire string
> on the right hand side.
> For eg say we are working with a territory where character 'z' has
> same collation elements as 'xy'. For such a territory consider 2
> clauses in WHERE clause
> 1)'zcb' = 'xycb'
> 2)'zcb' LIKE 'xy_b'
> For case 1), we will return TRUE because the collation elements for
> the entire string 'zcb' will match the collation elements of the
> entire string 'xycb'.
> For case 2) though, we will return FALSE because collation element(s)
> for character 'z' does not match the collation element(s) for
> character 'x'. So, as can be seen, the LIKE operation is one character
> at a time whereas = operation is the entire string at a time.
> In addition, the metacharacter _ in pattern for LIKE will consume
> *one* character in the string value. So for an eg clause 'xycb' LIKE
> '_cb' will return FALSE because metacharacter _ will consume 'x' and
> since 'c' does not match 'y', we will return FALSE.
> **********************************************************
>
> Is this the information you were looking for?
>
> thanks,
> Mamta
>
> On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
>
>> According to the Developer's Guide section titled "Character-based collation
>> in Derby", the LIKE operator behaves differently than the = operator. That
>> is, if you are using a territory-based collation, two strings might = one
>> another but not be LIKE one another (and vice-versa).
>>
>> In the meantime, according to the section in the Tuning Guide titled
>> "Character string beginning with constant", LIKE expressions which begin
>> with a constant (followed by a wildcard) can be transformed into indexable
>> expressions involving the ">=" and "<" operators.
>>
>> To my muddled understanding, it seems that one of the following occurs:
>>
>> 1) The LIKE optimizations are disabled if you are using territory-based
>> collation
>>
>> or
>>
>> 2) a LIKE expression may return different results depending on your indexes
>> and/or the presence of trailing wildcards in your LIKE expression.
>>
>> (1) seems like a performance issue which needs to be documented. (2) seems
>> like a correctness problem. What is the behavior that we expect?
>>
>> Thanks,
>> -Rick
>>
>>
Re: territory-based collations and optimizations for the LIKE operator
Posted by Mamta Satoor <ms...@gmail.com>.
Rick, the behavior we expect for LIKE and = in a territory based
database can be explained by the following comment that I picked up
from DERBY-3166
**********************************************************
The following applies to territory based database
For LIKE, when comparing a pattern against a value string, we do the
comparison of collation elements(s) for one character at a time for
non-metacharacters. This is different than what is done for =
operation. For =, we compare the collation elements for the entire
string on left hand with the collation elements of the entire string
on the right hand side.
For eg say we are working with a territory where character 'z' has
same collation elements as 'xy'. For such a territory consider 2
clauses in WHERE clause
1)'zcb' = 'xycb'
2)'zcb' LIKE 'xy_b'
For case 1), we will return TRUE because the collation elements for
the entire string 'zcb' will match the collation elements of the
entire string 'xycb'.
For case 2) though, we will return FALSE because collation element(s)
for character 'z' does not match the collation element(s) for
character 'x'. So, as can be seen, the LIKE operation is one character
at a time whereas = operation is the entire string at a time.
In addition, the metacharacter _ in pattern for LIKE will consume
*one* character in the string value. So for an eg clause 'xycb' LIKE
'_cb' will return FALSE because metacharacter _ will consume 'x' and
since 'c' does not match 'y', we will return FALSE.
**********************************************************
Is this the information you were looking for?
thanks,
Mamta
On 8/22/08, Rick Hillegas <Ri...@sun.com> wrote:
> According to the Developer's Guide section titled "Character-based collation
> in Derby", the LIKE operator behaves differently than the = operator. That
> is, if you are using a territory-based collation, two strings might = one
> another but not be LIKE one another (and vice-versa).
>
> In the meantime, according to the section in the Tuning Guide titled
> "Character string beginning with constant", LIKE expressions which begin
> with a constant (followed by a wildcard) can be transformed into indexable
> expressions involving the ">=" and "<" operators.
>
> To my muddled understanding, it seems that one of the following occurs:
>
> 1) The LIKE optimizations are disabled if you are using territory-based
> collation
>
> or
>
> 2) a LIKE expression may return different results depending on your indexes
> and/or the presence of trailing wildcards in your LIKE expression.
>
> (1) seems like a performance issue which needs to be documented. (2) seems
> like a correctness problem. What is the behavior that we expect?
>
> Thanks,
> -Rick
>