You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Hyrum K Wright <hy...@hyrumwright.org> on 2011/05/16 03:13:50 UTC

SQLite and the LIKE operator

Several places in wc_db we use the following pattern to select all
nodes with a common tree ancestor:
 WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3 ESCAPE '#')

While this works, there was some concern about whether or not SQLite
was using the proper indicies when executing this query.  By examining
the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
statements, I believe it does use the indicies as intended.

However, I stumbled across an alternate implementation which I believe
has some merit.  Instead of the above clause, we could use:
 WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2

This also avoids a table scan by making use of the indicies, but has
the advantage of not having to compute a separate parameter for the
LIKE clause in C.  It returns the same results, and has the benefit of
being a bit more clear to SQLite what we're trying to accomplish.  I'm
tempted to switch our code to using this new format, but wanted some
comments first.  I have not yet run extensive timing or other analysis
on the performance.

Thoughts?

-Hyrum

Re: SQLite and the LIKE operator

Posted by Julian Foad <ju...@wandisco.com>.
Branko Čibej wrote:
> On 16.05.2011 11:17, Hyrum K Wright wrote:
> > On Mon, May 16, 2011 at 8:28 AM, Bert Huijben <be...@qqmail.nl> wrote:
> >>> 2011/5/16 Branko Čibej <br...@e-reka.si>:
> >>>> On 16.05.2011 03:13, Hyrum K Wright wrote:
> >>>>> Several places in wc_db we use the following pattern to select all
> >>>>> nodes with a common tree ancestor:
> >>>>>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3
> >>> ESCAPE '#')
> >>>>> While this works, there was some concern about whether or not SQLite
> >>>>> was using the proper indicies when executing this query.  By examining
> >>>>> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
> >>>>> statements, I believe it does use the indicies as intended.
> >>>>>
> >>>>> However, I stumbled across an alternate implementation which I believe
> >>>>> has some merit.  Instead of the above clause, we could use:
> >>>>>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
> >> This also needs a table scan, as SQLite can't look through the substr to find that it can use the index for the result.
> > The SQLite query analyzer states that this executes a SEARCH, not a
> > SCAN, which indicates the use of the index.
> 
> It should be able to use the index to do prefix matching, yes. It's not
> inconceivable that "like foo%' would trigger a prefix match, too -- but
> figuring that out is likely a bit more work that guessing right with a
> prefix substring.

See <http://www.sqlite.org/optoverview.html#like_opt>.

- Julian



Re: SQLite and the LIKE operator

Posted by Branko Čibej <br...@e-reka.si>.
On 16.05.2011 11:17, Hyrum K Wright wrote:
> On Mon, May 16, 2011 at 8:28 AM, Bert Huijben <be...@qqmail.nl> wrote:
>>
>>> -----Original Message-----
>>> From: Hyrum K Wright [mailto:hyrum@hyrumwright.org]
>>> Sent: maandag 16 mei 2011 9:39
>>> To: Branko Čibej
>>> Cc: dev@subversion.apache.org
>>> Subject: Re: SQLite and the LIKE operator
>>>
>>> 2011/5/16 Branko Čibej <br...@e-reka.si>:
>>>> On 16.05.2011 03:13, Hyrum K Wright wrote:
>>>>> Several places in wc_db we use the following pattern to select all
>>>>> nodes with a common tree ancestor:
>>>>>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3
>>> ESCAPE '#')
>>>>> While this works, there was some concern about whether or not SQLite
>>>>> was using the proper indicies when executing this query.  By examining
>>>>> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
>>>>> statements, I believe it does use the indicies as intended.
>>>>>
>>>>> However, I stumbled across an alternate implementation which I believe
>>>>> has some merit.  Instead of the above clause, we could use:
>>>>>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
>> This also needs a table scan, as SQLite can't look through the substr to find that it can use the index for the result.
> The SQLite query analyzer states that this executes a SEARCH, not a
> SCAN, which indicates the use of the index.

It should be able to use the index to do prefix matching, yes. It's not
inconceivable that "like foo%' would trigger a prefix match, too -- but
figuring that out is likely a bit more work that guessing right with a
prefix substring.

-- Brane


RE: SQLite and the LIKE operator

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Hyrum K Wright [mailto:hyrum@hyrumwright.org]
> Sent: maandag 16 mei 2011 11:18
> To: Bert Huijben
> Cc: Branko Čibej; dev@subversion.apache.org
> Subject: Re: SQLite and the LIKE operator
> 
> On Mon, May 16, 2011 at 8:28 AM, Bert Huijben <be...@qqmail.nl> wrote:
> >
> >
> >> -----Original Message-----
> >> From: Hyrum K Wright [mailto:hyrum@hyrumwright.org]
> >> Sent: maandag 16 mei 2011 9:39
> >> To: Branko Čibej
> >> Cc: dev@subversion.apache.org
> >> Subject: Re: SQLite and the LIKE operator
> >>
> >> 2011/5/16 Branko Čibej <br...@e-reka.si>:
> >> > On 16.05.2011 03:13, Hyrum K Wright wrote:
> >> >> Several places in wc_db we use the following pattern to select all
> >> >> nodes with a common tree ancestor:
> >> >>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3
> >> ESCAPE '#')
> >> >>
> >> >> While this works, there was some concern about whether or not
> SQLite
> >> >> was using the proper indicies when executing this query.  By examining
> >> >> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
> >> >> statements, I believe it does use the indicies as intended.
> >> >>
> >> >> However, I stumbled across an alternate implementation which I
> believe
> >> >> has some merit.  Instead of the above clause, we could use:
> >> >>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
> >
> > This also needs a table scan, as SQLite can't look through the substr to find
> that it can use the index for the result.
> 
> The SQLite query analyzer states that this executes a SEARCH, not a
> SCAN, which indicates the use of the index.
> 
> > My guess is that
> > WHERE wc_id = ?1 AND ((local_relpath = ?2) OR (local_relpath > ?2 || '/'
> AND local_relpath < ?2 || '0'))
> > is most likely the most efficient form we can get in SQLite as the constant
> string directly map to an index, but we should really create a few tests to
> verify these guesses.
> 
> I haven't done any tests, either, but I'm interested in an expression
> which doesn't require us to construct an additional parameter to the
> SQL query in C.
> 
> > I'm going to the Elego office now. See you there? :)

Okay, the numbers are in... I wrote a test on a DB with more than 1 million nodes.
1024 directories, with 1024 files. 1024 files added to the root dir and before I started there was the normal greek tree, so that is in the same db

-- STMT_SELECT_LIKE_ESCAPED
 SELECT local_relpath FROM NODES
 WHERE wc_id=?1 AND op_depth=0
   AND ((local_relpath=?2)
        OR (local_relpath LIKE ?3 ESCAPE '#'))

-- STMT_SELECT_LIKE_NO_ESCAPE
 SELECT local_relpath FROM NODES
 WHERE wc_id=?1 AND op_depth=0
   AND ((local_relpath=?2)
        OR (local_relpath LIKE ?3))

-- STMT_SELECT_SUBSTR
SELECT local_relpath FROM NODES
 WHERE wc_id=?1 AND op_depth=0
   AND ((local_relpath=?2)
        OR (SUBSTR(local_relpath, 1, LENGTH(?2)+1) = ?2 || '/'))

-- STMT_SELECT_BETWEEN
SELECT local_relpath FROM NODES
 WHERE wc_id=?1 AND op_depth=0
   AND ((local_relpath=?2)
        OR (local_relpath > ?2 || '/' AND local_relpath < ?2 || '0'))

-- STMT_SELECT_BETWEEN2
SELECT local_relpath FROM NODES
 WHERE wc_id=?1 AND op_depth=0
   AND ((local_relpath=?2)
        OR (local_relpath > ?3 AND local_relpath < ?4))

All the queries returned the same 1025 answers when I looked for all nodes at and below "300".

I ran the tests 100 times to remove some jitter and the results are (in microseconds, on my Windows 7 notebook, SQLite 3.7.5):
STMT_SELECT_LIKE_ESCAPED:  743252
STMT_SELECT_LIKE_NO_ESCAPE: 683169
STMT_SELECT_SUBSTR: 885900
STMT_SELECT_BETWEEN: 555341
STMT_SELECT_BETWEEN2: 557631

So the select with > and < is the fastest query style. (Not surprising as SQLite in some cases optimizes like to this).

Generating the compare keys in SQLite is slightly faster than using static c strings. So creating them in cwould be even slower.

I added the rough patch I used to create these results to this mail. (I don't intend to commit it in this style)

	Bert
> 
> Yes. :)
> 
> -Hyrum

Re: SQLite and the LIKE operator

Posted by Hyrum K Wright <hy...@hyrumwright.org>.
On Mon, May 16, 2011 at 8:28 AM, Bert Huijben <be...@qqmail.nl> wrote:
>
>
>> -----Original Message-----
>> From: Hyrum K Wright [mailto:hyrum@hyrumwright.org]
>> Sent: maandag 16 mei 2011 9:39
>> To: Branko Čibej
>> Cc: dev@subversion.apache.org
>> Subject: Re: SQLite and the LIKE operator
>>
>> 2011/5/16 Branko Čibej <br...@e-reka.si>:
>> > On 16.05.2011 03:13, Hyrum K Wright wrote:
>> >> Several places in wc_db we use the following pattern to select all
>> >> nodes with a common tree ancestor:
>> >>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3
>> ESCAPE '#')
>> >>
>> >> While this works, there was some concern about whether or not SQLite
>> >> was using the proper indicies when executing this query.  By examining
>> >> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
>> >> statements, I believe it does use the indicies as intended.
>> >>
>> >> However, I stumbled across an alternate implementation which I believe
>> >> has some merit.  Instead of the above clause, we could use:
>> >>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
>
> This also needs a table scan, as SQLite can't look through the substr to find that it can use the index for the result.

The SQLite query analyzer states that this executes a SEARCH, not a
SCAN, which indicates the use of the index.

> My guess is that
> WHERE wc_id = ?1 AND ((local_relpath = ?2) OR (local_relpath > ?2 || '/' AND local_relpath < ?2 || '0'))
> is most likely the most efficient form we can get in SQLite as the constant string directly map to an index, but we should really create a few tests to verify these guesses.

I haven't done any tests, either, but I'm interested in an expression
which doesn't require us to construct an additional parameter to the
SQL query in C.

> I'm going to the Elego office now. See you there? :)

Yes. :)

-Hyrum

RE: SQLite and the LIKE operator

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Hyrum K Wright [mailto:hyrum@hyrumwright.org]
> Sent: maandag 16 mei 2011 9:39
> To: Branko Čibej
> Cc: dev@subversion.apache.org
> Subject: Re: SQLite and the LIKE operator
> 
> 2011/5/16 Branko Čibej <br...@e-reka.si>:
> > On 16.05.2011 03:13, Hyrum K Wright wrote:
> >> Several places in wc_db we use the following pattern to select all
> >> nodes with a common tree ancestor:
> >>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3
> ESCAPE '#')
> >>
> >> While this works, there was some concern about whether or not SQLite
> >> was using the proper indicies when executing this query.  By examining
> >> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
> >> statements, I believe it does use the indicies as intended.
> >>
> >> However, I stumbled across an alternate implementation which I believe
> >> has some merit.  Instead of the above clause, we could use:
> >>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2

This also needs a table scan, as SQLite can't look through the substr to find that it can use the index for the result.

My guess is that
WHERE wc_id = ?1 AND ((local_relpath = ?2) OR (local_relpath > ?2 || '/' AND local_relpath < ?2 || '0'))
is most likely the most efficient form we can get in SQLite as the constant string directly map to an index, but we should really create a few tests to verify these guesses.

I'm going to the Elego office now. See you there? :)

	Bert


RE: SQLite and the LIKE operator

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Branko Čibej [mailto:brane@xbc.nu] On Behalf Of Branko Cibej
> Sent: maandag 16 mei 2011 9:46
> To: dev@subversion.apache.org
> Subject: Re: SQLite and the LIKE operator
> 
> On 16.05.2011 09:38, Hyrum K Wright wrote:
> > 2011/5/16 Branko Čibej <br...@e-reka.si>:
> >> On 16.05.2011 03:13, Hyrum K Wright wrote:
> >>> Several places in wc_db we use the following pattern to select all
> >>> nodes with a common tree ancestor:
> >>>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3
> ESCAPE '#')
> >>>
> >>> While this works, there was some concern about whether or not SQLite
> >>> was using the proper indicies when executing this query.  By examining
> >>> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
> >>> statements, I believe it does use the indicies as intended.
> >>>
> >>> However, I stumbled across an alternate implementation which I believe
> >>> has some merit.  Instead of the above clause, we could use:
> >>>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
> >>>
> >>> This also avoids a table scan by making use of the indicies, but has
> >>> the advantage of not having to compute a separate parameter for the
> >>> LIKE clause in C.  It returns the same results, and has the benefit of
> >>> being a bit more clear to SQLite what we're trying to accomplish.  I'm
> >>> tempted to switch our code to using this new format, but wanted some
> >>> comments first.  I have not yet run extensive timing or other analysis
> >>> on the performance.
> >>>
> >>> Thoughts?
> >>>
> >>> -Hyrum
> >> Can't be right. I'm assuming the first query works correctly iff:
> >>    ?2 = foo
> >>    ?3 = foo/%
> >>
> >> and returns 'foo' and all its subtree.
> >>
> >> The second query can't return the same results; if ?2=foo, it'll match
> >> foobar, which is not foo's child; if ?2=foo/, it won't return foo.
> > That's what I get for writing mail at 3am.
> >
> > I believe the following would fix this:
> >  WHERE wc_id = ?1 AND (local_relpath = ?2 OR substr(local_relpath, 1,
> > length(?2 + 1)) = ?2 || '/')
> >
> > -Hyrum
> 
> That query used to be:
> 
>     local_relpath=?2 OR local_relpath LIKE ?2 || '/%'
> 
> but, for obvious reasons, that was a potential bug since literal % were
> not escaped. Your latest proposal is broken, but I'm sure you'll find
> the bug eventually. :)
> 
> Whether substr can be faster than LIKE -- I have no idea.

I think you were talking about the increment problem, but there is another problem. Local_relpath "" would give '' || '/', which is the invalid relpath '/'.

	Bert


Re: SQLite and the LIKE operator

Posted by Branko Čibej <br...@e-reka.si>.
On 16.05.2011 09:38, Hyrum K Wright wrote:
> 2011/5/16 Branko Čibej <br...@e-reka.si>:
>> On 16.05.2011 03:13, Hyrum K Wright wrote:
>>> Several places in wc_db we use the following pattern to select all
>>> nodes with a common tree ancestor:
>>>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3 ESCAPE '#')
>>>
>>> While this works, there was some concern about whether or not SQLite
>>> was using the proper indicies when executing this query.  By examining
>>> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
>>> statements, I believe it does use the indicies as intended.
>>>
>>> However, I stumbled across an alternate implementation which I believe
>>> has some merit.  Instead of the above clause, we could use:
>>>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
>>>
>>> This also avoids a table scan by making use of the indicies, but has
>>> the advantage of not having to compute a separate parameter for the
>>> LIKE clause in C.  It returns the same results, and has the benefit of
>>> being a bit more clear to SQLite what we're trying to accomplish.  I'm
>>> tempted to switch our code to using this new format, but wanted some
>>> comments first.  I have not yet run extensive timing or other analysis
>>> on the performance.
>>>
>>> Thoughts?
>>>
>>> -Hyrum
>> Can't be right. I'm assuming the first query works correctly iff:
>>    ?2 = foo
>>    ?3 = foo/%
>>
>> and returns 'foo' and all its subtree.
>>
>> The second query can't return the same results; if ?2=foo, it'll match
>> foobar, which is not foo's child; if ?2=foo/, it won't return foo.
> That's what I get for writing mail at 3am.
>
> I believe the following would fix this:
>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR substr(local_relpath, 1,
> length(?2 + 1)) = ?2 || '/')
>
> -Hyrum

That query used to be:

    local_relpath=?2 OR local_relpath LIKE ?2 || '/%'

but, for obvious reasons, that was a potential bug since literal % were
not escaped. Your latest proposal is broken, but I'm sure you'll find
the bug eventually. :)

Whether substr can be faster than LIKE -- I have no idea.

-- Brane

Re: SQLite and the LIKE operator

Posted by Hyrum K Wright <hy...@hyrumwright.org>.
2011/5/16 Branko Čibej <br...@e-reka.si>:
> On 16.05.2011 03:13, Hyrum K Wright wrote:
>> Several places in wc_db we use the following pattern to select all
>> nodes with a common tree ancestor:
>>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3 ESCAPE '#')
>>
>> While this works, there was some concern about whether or not SQLite
>> was using the proper indicies when executing this query.  By examining
>> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
>> statements, I believe it does use the indicies as intended.
>>
>> However, I stumbled across an alternate implementation which I believe
>> has some merit.  Instead of the above clause, we could use:
>>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
>>
>> This also avoids a table scan by making use of the indicies, but has
>> the advantage of not having to compute a separate parameter for the
>> LIKE clause in C.  It returns the same results, and has the benefit of
>> being a bit more clear to SQLite what we're trying to accomplish.  I'm
>> tempted to switch our code to using this new format, but wanted some
>> comments first.  I have not yet run extensive timing or other analysis
>> on the performance.
>>
>> Thoughts?
>>
>> -Hyrum
>
> Can't be right. I'm assuming the first query works correctly iff:
>    ?2 = foo
>    ?3 = foo/%
>
> and returns 'foo' and all its subtree.
>
> The second query can't return the same results; if ?2=foo, it'll match
> foobar, which is not foo's child; if ?2=foo/, it won't return foo.

That's what I get for writing mail at 3am.

I believe the following would fix this:
 WHERE wc_id = ?1 AND (local_relpath = ?2 OR substr(local_relpath, 1,
length(?2 + 1)) = ?2 || '/')

-Hyrum

Re: SQLite and the LIKE operator

Posted by Branko Čibej <br...@e-reka.si>.
On 16.05.2011 03:13, Hyrum K Wright wrote:
> Several places in wc_db we use the following pattern to select all
> nodes with a common tree ancestor:
>  WHERE wc_id = ?1 AND (local_relpath = ?2 OR local_relpath LIKE ?3 ESCAPE '#')
>
> While this works, there was some concern about whether or not SQLite
> was using the proper indicies when executing this query.  By examining
> the output for 'EXPLAIN QUERY PLAN' on some of the relevant SELECT
> statements, I believe it does use the indicies as intended.
>
> However, I stumbled across an alternate implementation which I believe
> has some merit.  Instead of the above clause, we could use:
>  WHERE wc_id = ?1 AND substr(local_relpath, 1, length(?2)) = ?2
>
> This also avoids a table scan by making use of the indicies, but has
> the advantage of not having to compute a separate parameter for the
> LIKE clause in C.  It returns the same results, and has the benefit of
> being a bit more clear to SQLite what we're trying to accomplish.  I'm
> tempted to switch our code to using this new format, but wanted some
> comments first.  I have not yet run extensive timing or other analysis
> on the performance.
>
> Thoughts?
>
> -Hyrum

Can't be right. I'm assuming the first query works correctly iff:
    ?2 = foo
    ?3 = foo/%

and returns 'foo' and all its subtree.

The second query can't return the same results; if ?2=foo, it'll match
foobar, which is not foo's child; if ?2=foo/, it won't return foo.

-- Brane