You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by "Rick Hillegas (JIRA)" <ji...@apache.org> on 2008/08/26 20:12:44 UTC

[jira] Created: (DERBY-3854) Implement LIKE transformations and optimizations for databases using territory-based collations

Implement LIKE transformations and optimizations for databases using territory-based collations
-----------------------------------------------------------------------------------------------

                 Key: DERBY-3854
                 URL: https://issues.apache.org/jira/browse/DERBY-3854
             Project: Derby
          Issue Type: Improvement
    Affects Versions: 10.4.1.3, 10.3.3.0, 10.5.0.0
            Reporter: Rick Hillegas


The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DERBY-3854) Implement LIKE transformations and optimizations for databases using territory-based collations

Posted by "Daniel John Debrunner (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DERBY-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627048#action_12627048 ] 

Daniel John Debrunner commented on DERBY-3854:
----------------------------------------------

> (1) all the characters before the wildcard map to exactly one collation element,

Need to add 'each' at the end, right, just to be clear?

Though I maybe having trouble understanding what this is trying to say exactly. All single characters will map to a single collation element by themselves, so I think it's really trying to say that none of the sequence of characters in the prefix combine into a single collation element.???

> col < 'abcde\uFFFF'

\uFFFF is incorrect since it is not a valid character and the numeric codepoint of the character is not what the index is using for ordering. You really need

  col < 'abcdf'

where 'f' is the character determined from a collation element value than is greater than the one for 'e'


> Implement LIKE transformations and optimizations for databases using territory-based collations
> -----------------------------------------------------------------------------------------------
>
>                 Key: DERBY-3854
>                 URL: https://issues.apache.org/jira/browse/DERBY-3854
>             Project: Derby
>          Issue Type: Improvement
>    Affects Versions: 10.3.3.0, 10.4.1.3, 10.5.0.0
>            Reporter: Rick Hillegas
>
> The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DERBY-3854) Implement LIKE transformations and optimizations for databases using territory-based collations

Posted by "Kathey Marsden (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DERBY-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kathey Marsden updated DERBY-3854:
----------------------------------

    Component/s: SQL

> Implement LIKE transformations and optimizations for databases using territory-based collations
> -----------------------------------------------------------------------------------------------
>
>                 Key: DERBY-3854
>                 URL: https://issues.apache.org/jira/browse/DERBY-3854
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.3.3.0, 10.4.1.3, 10.5.0.0
>            Reporter: Rick Hillegas
>
> The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DERBY-3854) Implement LIKE transformations and optimizations for databases using territory-based collations

Posted by "Knut Anders Hatlen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DERBY-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627464#action_12627464 ] 

Knut Anders Hatlen commented on DERBY-3854:
-------------------------------------------

>> (1) all the characters before the wildcard map to exactly one
>> collation element,
>
> Need to add 'each' at the end, right, just to be clear?

Right.

> Though I maybe having trouble understanding what this is trying to
> say exactly. All single characters will map to a single collation
> element by themselves, so I think it's really trying to say that
> none of the sequence of characters in the prefix combine into a
> single collation element.???

Yes, I think I messed up the terminology here. Actually, a single
character can map to a sequence of collation elements, for instance
the character 'ä' maps to two collation elements in German locale
(mentioned in the class javadoc for CollationElementIterator). But I
think you're right that we only need special handling for the
multi-character sequences that combine into a different sequence of
collation elements than what you get by concatenating the collation
elements for each single character in the sequence.

So to use the terminology from RuleBasedCollator, the problematic
sequences are a subset of those text-arguments whose length is greater
than 1.

Take for instance this fragment of the return value from getRules()
with German (de_DE) collation:

   r,R<s, S & SS,ß<t,T& TH, Þ &TH, þ <u,U

In the rules above, we have two multi-character text-arguments: SS and
TH. None of them are problematic, however, since the collation
elements for 'SS' are the same as the collation elements for 'S'
concatenated with the collation elements for another 'S', and the
collation elements for 'TH' are the same as for 'T' followed by 'H'.

Here's another fragment from Norwegian (no_NO) collation:

  õ, Õ < å, Å, aa , aA , Aa , AA  & V < w, W

All of the two-character sequences above (aa, aA, Aa, AA) are
problematic because each of them results in a single collation element
and therefore doesn't match the collation elements when each character
is considered on its own.

>> col < 'abcde\uFFFF'
>
> \uFFFF is incorrect since it is not a valid character and the
> numeric codepoint of the character is not what the index is using
> for ordering. You really need
>
>   col < 'abcdf'
>
> where 'f' is the character determined from a collation element value
> than is greater than the one for 'e'

Right. I don't think it's a problem that \uFFFF is an invalid
character, since that's the character that's used by the current LIKE
optimization. (Perhaps it's because it is an invalid character we can
get away with it when we have collation=UCS_BASIC? I would have
expected that we needed an infinitely long string containing \uFFFF
characters, not just a single character, to get a proper upper limit.)

But you're right that if the RuleBasedCollator has explicitly defined
an ordering for \uFFFF we cannot use it as an upper limit. I'm not
sure what the best way to find that upper limit is. For the characters
mentioned in the rules we get from getRules(), it should be trivial,
as we can get the full ordering when we parse the rules. But a large
number of the characters aren't mentioned in the rules. For instance,
the rules for de_DE don't contain any of the characters with accents,
as their ordering is determined by the decomposition mode of the
collator, nor do they contain letters from non-Latin alphabets.

> Implement LIKE transformations and optimizations for databases using territory-based collations
> -----------------------------------------------------------------------------------------------
>
>                 Key: DERBY-3854
>                 URL: https://issues.apache.org/jira/browse/DERBY-3854
>             Project: Derby
>          Issue Type: Improvement
>    Affects Versions: 10.3.3.0, 10.4.1.3, 10.5.0.0
>            Reporter: Rick Hillegas
>
> The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DERBY-3854) Implement LIKE transformations and optimizations for databases using territory-based collations

Posted by "Knut Anders Hatlen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DERBY-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626986#action_12626986 ] 

Knut Anders Hatlen commented on DERBY-3854:
-------------------------------------------

Don't know if it'll work or if I use the correct terminology, but here
are some thoughts on how we could optimize LIKE when using collation:

The collators we use are always of type RuleBasedCollator, from which
we can receive the rules by calling getRules(). By looking at the
rules, we may be able to perform some optimizations. One example:

  col LIKE 'abcde%'

If we know, by looking at the rules, that (1) all the characters
before the wildcard map to exactly one collation element, and (2) that
none of them are ignorable, and (3) that the last characters up to the
wildcard cannot be the start of a sequence of characters mapping to
one collation element, we can perform the same optimization as we can
when we're not using collation. That is, we can add the predicate
'abcde' <= col < 'abcde\uFFFF' to limit the scan.

Whereas databases using collation=USC_BASIC can always use the prefix
up to the first wildcard character for optimizations like this, I
think databases with territory-based collation can only use the prefix
up to the first occurrence of one of

  a) A wildcard character

  b) An ignorable character

  c) A character that maps to a sequence of collation elements

  d) A sequence of characters mapping to one collation element

  e) A sequence of characters which could be the start of a sequence
  that maps to one collation element, directly followed by a wildcard

So with territory=no_NO, where a < z < aa, we'd optimize the
expression

  col LIKE 'data%'

by adding the predicate 'dat' <= col < 'dat\uFFFF' because the 'a'
right before the '%' could be the start of the sequence 'aa' which
maps to a single collation element (rule (e) above). (Or if we want to
be really sophisticated, I think we can use the predicate 'data' <=
col < 'dataa\uFFFF' in this case, but that would require a more
thorough analysis of the collator's rules.)

> Implement LIKE transformations and optimizations for databases using territory-based collations
> -----------------------------------------------------------------------------------------------
>
>                 Key: DERBY-3854
>                 URL: https://issues.apache.org/jira/browse/DERBY-3854
>             Project: Derby
>          Issue Type: Improvement
>    Affects Versions: 10.3.3.0, 10.4.1.3, 10.5.0.0
>            Reporter: Rick Hillegas
>
> The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DERBY-3854) Implement LIKE transformations and optimizations for databases using territory-based collations

Posted by "Rick Hillegas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DERBY-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627040#action_12627040 ] 

Rick Hillegas commented on DERBY-3854:
--------------------------------------

Hi Knut,

This looks like a promising approach. It certainly seems that we ought to be able to find a longest, well-behaved leading prefix of the LIKE constant--and then build bounds from that prefix. Your rules look good to me. That is, I haven't been able to come up with a problem case. That's a bit shy of proof, of course. And I am a bit muddled by the interaction of the Collator levers, strength and decomposition.

> Implement LIKE transformations and optimizations for databases using territory-based collations
> -----------------------------------------------------------------------------------------------
>
>                 Key: DERBY-3854
>                 URL: https://issues.apache.org/jira/browse/DERBY-3854
>             Project: Derby
>          Issue Type: Improvement
>    Affects Versions: 10.3.3.0, 10.4.1.3, 10.5.0.0
>            Reporter: Rick Hillegas
>
> The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.