You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Geoff hendrey <ge...@yahoo.com> on 2009/05/02 22:22:21 UTC

The result offset and fetch first clauses

Congratulations on implementing "The result offset and fetch first clauses". I'd be even more excited if I hadn't just finished implementing this at the application-layer on top of derby 10.4, but hey, I'm happy to throw away my code now that derby has this internalized.

I have a couple questions before I throw away my code:

1) can I use this feature even if I am ordering on multiple columns?

2) what are the performance implications for users of the embedded driver? In particular, with the embedded driver I am hoping that this feature allows portions of a result set to be retrieved without the overhead of retrieving the entire result set. For example, if I have a million rows in a product catalog, and a user of my web app wants to sort by product name and jump to a particular portion of the result set, I was hoping this would be efficient in your implementation.

If (2) is not efficient, how does it compare to the efficiency of the following approach:

Get the result set. Use a loop to increment integer n by PAGE_SIZE, and inside the loop use ResultSet.absolute(n) combined with stmt.setFetchSize(1) to retrieve a "marker" row that signifies the begining of each "page" of the result set. I use the primary keys of these "markers" as page boundaries so that my web application can provide links to a set of pages evenly distributes throughout the result set.

 -geoff
“XML? Too much like HTML. It'll never work on the Web!” 
-anonymous

Re: The result offset and fetch first clauses

Posted by Alan Burlison <Al...@sun.com>.

Knut Anders Hatlen wrote:

>> My understanding is that for queries that use >=, =< etc Derby can
>> already use an index scan if the column being compared has an index on
>> it - in my case it does.  So by switching to RESULT/OFFSET I'd lose
>> that benefit, correct?
> 
> Yes, with >= an index scan can go directly to the first row
> requested. With an offset clause the index scan would need to start at
> the beginning of the index and work its way forward, counting rows. This
> is because the index scans don't have an interface that allows you to
> say "position on the Nth row". One part of the problem is that the index
> may contain deleted rows, newly inserted rows that have not yet been
> committed, and deleted rows that have not been committed, so for OFFSET
> to work correctly, the scan needs to not count the deleted rows, and
> possibly wait for uncommitted changes to be committed depending on the
> isolation level.

That all makes sense - thanks for the explanation.

> Another aspect is that using >= on the index is probably more likely to
> yield the correct results in the scenario you're describing, as OFFSET
> doesn't guarantee that you start right after the last result on the
> previous page if the table has been modified (inserts/deletes) in
> between.

Yes, not a major issue in our case as the tables don't change that 
often, but being both correct and fast seems like a win-win situation.

>>> Often, this feature is used together with ORDER BY which would entail
>>> some sorting of the result set and then all the rows would have to be
>>> read anyway. Again, for some simple queries, sort avoidance is used by
>>> the optimizer, so optimization is still possible for for such queries.
>> What if the ORDER BY clause only uses indexed columns?  Presumably
>> Derby can just return the rows in index order in that case, and no
>> sort is required?
> 
> Yes, the optimizer will try to avoid sorts if possible, also if the
> query contains an OFFSET clause.

In my case there is an ORDER BY, but it is always on the indexed column, 
so it sound like I'm in the zone there too.

> Note that Derby doesn't currently walk indexes backwards, so a sort is
> only avoided if the index is in the same order as the requested ordering
> of the result. CREATE INDEX allows you to create both ascending and
> descending indexes.
> http://db.apache.org/derby/docs/10.5/ref/rrefsqlj20937.html

For backwards scrolling the query uses <= and DESC but there's only an 
ASC index so I'm probably losing out there.  However backwards scrolling 
  happens less frequently, so that's a reasonable compromise between 
index overhead and performance.

Thanks for the info, very helpful :-)

-- 
Alan Burlison
--

Re: The result offset and fetch first clauses

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Alan Burlison <Al...@Sun.COM> writes:

> Dag H. Wanvik wrote:
>
>> I am afraid that with embedded driver, you will only save a little CPU
>> (by avoiding some JDBC calls) since under the hood, the code siphons
>> off the rows till it hits the offset, so if you have a large offset,
>> you will still incur reading of those rows (modulo page caching). In
>> client/server driver context the savings are larger, of course, in
>> that fewer rows are sent over the wire. For simple queries that can
>> use an index, the optimizer could make use of the offset information
>> to avoid reading the entire row when skipping rows before offset, just
>> counting rows in the index to get to the first qualifying row, but
>> this optimization is not yet implemented.
>
> My understanding is that for queries that use >=, =< etc Derby can
> already use an index scan if the column being compared has an index on
> it - in my case it does.  So by switching to RESULT/OFFSET I'd lose
> that benefit, correct?

Yes, with >= an index scan can go directly to the first row
requested. With an offset clause the index scan would need to start at
the beginning of the index and work its way forward, counting rows. This
is because the index scans don't have an interface that allows you to
say "position on the Nth row". One part of the problem is that the index
may contain deleted rows, newly inserted rows that have not yet been
committed, and deleted rows that have not been committed, so for OFFSET
to work correctly, the scan needs to not count the deleted rows, and
possibly wait for uncommitted changes to be committed depending on the
isolation level.

Another aspect is that using >= on the index is probably more likely to
yield the correct results in the scenario you're describing, as OFFSET
doesn't guarantee that you start right after the last result on the
previous page if the table has been modified (inserts/deletes) in
between.

>> Often, this feature is used together with ORDER BY which would entail
>> some sorting of the result set and then all the rows would have to be
>> read anyway. Again, for some simple queries, sort avoidance is used by
>> the optimizer, so optimization is still possible for for such queries.
>
> What if the ORDER BY clause only uses indexed columns?  Presumably
> Derby can just return the rows in index order in that case, and no
> sort is required?

Yes, the optimizer will try to avoid sorts if possible, also if the
query contains an OFFSET clause.

Note that Derby doesn't currently walk indexes backwards, so a sort is
only avoided if the index is in the same order as the requested ordering
of the result. CREATE INDEX allows you to create both ascending and
descending indexes.
http://db.apache.org/derby/docs/10.5/ref/rrefsqlj20937.html

-- 
Knut Anders

Re: The result offset and fetch first clauses

Posted by Alan Burlison <Al...@sun.com>.

Dag H. Wanvik wrote:

> I am afraid that with embedded driver, you will only save a little CPU
> (by avoiding some JDBC calls) since under the hood, the code siphons
> off the rows till it hits the offset, so if you have a large offset,
> you will still incur reading of those rows (modulo page caching). In
> client/server driver context the savings are larger, of course, in
> that fewer rows are sent over the wire. For simple queries that can
> use an index, the optimizer could make use of the offset information
> to avoid reading the entire row when skipping rows before offset, just
> counting rows in the index to get to the first qualifying row, but
> this optimization is not yet implemented.

My understanding is that for queries that use >=, =< etc Derby can 
already use an index scan if the column being compared has an index on 
it - in my case it does.  So by switching to RESULT/OFFSET I'd lose that 
benefit, correct?

> Often, this feature is used together with ORDER BY which would entail
> some sorting of the result set and then all the rows would have to be
> read anyway. Again, for some simple queries, sort avoidance is used by
> the optimizer, so optimization is still possible for for such queries.

What if the ORDER BY clause only uses indexed columns?  Presumably Derby 
can just return the rows in index order in that case, and no sort is 
required?

-- 
Alan Burlison
--

Re: The result offset and fetch first clauses

Posted by "Dag H. Wanvik" <Da...@Sun.COM>.

Geoff hendrey <ge...@yahoo.com> writes:

> Could we discuss this further on this list, and decide if an issue
> should be opened for optimization?

I think it it fine to open an issue for optimizing this.

Dag

Re: The result offset and fetch first clauses

Posted by Geoff hendrey <ge...@yahoo.com>.

Thanks for the feedback.

I guess it *is* too soon for me to throw away my code :-)

I do think it is important for Derby to optimize the OFFSET/NEXT implementation. "Paging" is inherently something you do when you have large result sets. If OFFSET has to read process all of the result set up to the offset, then it really defeats the purpose and doesn't seem like anything more than a syntactic shortcut.

Could we discuss this further on this list, and decide if an issue should be opened for optimization?

 -geoff
“XML? Too much like HTML. It'll never work on the Web!” 
-anonymous 

________________________________
From: Dag H. Wanvik <Da...@Sun.COM>
To: Derby Discussion <de...@db.apache.org>
Sent: Monday, May 4, 2009 5:19:15 AM
Subject: Re: The result offset and fetch first clauses

Geoff hendrey <ge...@yahoo.com> writes:

> 1) can I use this feature even if I am ordering on multiple columns?

As Tiago said, the feature works on all result sets, including
updatable ones.

>
> 2) what are the performance implications for users of the embedded
> driver? In particular, with the embedded driver I am hoping that
> this feature allows portions of a result set to be retrieved without
> the overhead of retrieving the entire result set. For example, if I

I am afraid that with embedded driver, you will only save a little CPU
(by avoiding some JDBC calls) since under the hood, the code siphons
off the rows till it hits the offset, so if you have a large offset,
you will still incur reading of those rows (modulo page caching). In
client/server driver context the savings are larger, of course, in
that fewer rows are sent over the wire. For simple queries that can
use an index, the optimizer could make use of the offset information
to avoid reading the entire row when skipping rows before offset, just
counting rows in the index to get to the first qualifying row, but
this optimization is not yet implemented.

Often, this feature is used together with ORDER BY which would entail
some sorting of the result set and then all the rows would have to be
read anyway. Again, for some simple queries, sort avoidance is used by
the optimizer, so optimization is still possible for for such queries.

If you think this optimization is an important capability feel free to
file an improvement issue for it.

Any rows after the FETCH n rows are not processed for simple
queries. For complex queries they might be read to perform sorting.

> If (2) is not efficient, how does it compare to the efficiency of the following approach:
>
> Get the result set. Use a loop to increment integer n by PAGE_SIZE,
> and inside the loop use ResultSet.absolute(n) combined with
> stmt.setFetchSize(1) to retrieve a "marker" row that signifies the
> begining of each "page" of the result set. I use the primary keys of
> these "markers" as page boundaries so that my web application can
> provide links to a set of pages evenly distributes throughout the
> result set.

This sounds like a resonable approach to me, when any concurrent
insert/deletes concerns are addressed, or are irrelevant.

Dag

Re: The result offset and fetch first clauses

Posted by "Dag H. Wanvik" <Da...@Sun.COM>.

Geoff hendrey <ge...@yahoo.com> writes:

> 1) can I use this feature even if I am ordering on multiple columns?

As Tiago said, the feature works on all result sets, including
updatable ones.

>
> 2) what are the performance implications for users of the embedded
> driver? In particular, with the embedded driver I am hoping that
> this feature allows portions of a result set to be retrieved without
> the overhead of retrieving the entire result set. For example, if I

I am afraid that with embedded driver, you will only save a little CPU
(by avoiding some JDBC calls) since under the hood, the code siphons
off the rows till it hits the offset, so if you have a large offset,
you will still incur reading of those rows (modulo page caching). In
client/server driver context the savings are larger, of course, in
that fewer rows are sent over the wire. For simple queries that can
use an index, the optimizer could make use of the offset information
to avoid reading the entire row when skipping rows before offset, just
counting rows in the index to get to the first qualifying row, but
this optimization is not yet implemented.

Often, this feature is used together with ORDER BY which would entail
some sorting of the result set and then all the rows would have to be
read anyway. Again, for some simple queries, sort avoidance is used by
the optimizer, so optimization is still possible for for such queries.

If you think this optimization is an important capability feel free to
file an improvement issue for it.

Any rows after the FETCH n rows are not processed for simple
queries. For complex queries they might be read to perform sorting.

> If (2) is not efficient, how does it compare to the efficiency of the following approach:
>
> Get the result set. Use a loop to increment integer n by PAGE_SIZE,
> and inside the loop use ResultSet.absolute(n) combined with
> stmt.setFetchSize(1) to retrieve a "marker" row that signifies the
> begining of each "page" of the result set. I use the primary keys of
> these "markers" as page boundaries so that my web application can
> provide links to a set of pages evenly distributes throughout the
> result set.

This sounds like a resonable approach to me, when any concurrent
insert/deletes concerns are addressed, or are irrelevant.

Dag

Re: The result offset and fetch first clauses

Posted by Geoff hendrey <ge...@yahoo.com>.

Hi Alan,

Yes, sounds like what we are doing is nearly identical. The reason for traversing the result set once, is to create a set of page boundaries that can support a web application that needs to build a set of links, like an index into a user directory (like the way LinkedIn has alphabetical links for all your connections). A few definitions for the query below:

orderColumn is the name of the column on which we want to order. 
startAfterVal is a value for the orderColumn representing a "page boundary"
The JDBC driver is set to return a maximum number of results equal to the desired page size.

for ascending ordering:
...WHERE (orderColumn = startAfterVal AND PK > startAfterPK) OR orderColumn > startAfterVal ORDER BY orderColumn ASC, PK ASK

for descending queries:
...WHERE (orderColumn = startAfterVal AND PK < startAfterPK) OR
orderColumn < startAfterVal ORDER BY orderColumn DESC, PK DESC

The portion of the query that says "(orderCOlumn = startAfterVal AND PK [<|>] startAfterPK)" insures consistent scroll ordering in the following case:

imagine you have a very large user directory table with thousands of rows with a LASTNAME column equal to "SMITH". Then you excecute the query above, using LASTNAME as the orderColumn. The afformentioned portion of the query insures that as you page forward and backward you are not getting random SMITH rows, but rather the same SMITH rows in identical order. Very important for any application like a phone book or user directory.

 -geoff
“XML? Too much like HTML. It'll never work on the Web!” 
-anonymous 

________________________________
From: Alan Burlison <Al...@sun.com>
To: Derby Discussion <de...@db.apache.org>
Sent: Sunday, May 3, 2009 2:45:48 AM
Subject: Re: The result offset and fetch first clauses

Geoff hendrey wrote:

> Get the result set. Use a loop to increment integer n by PAGE_SIZE,
> and inside the loop use ResultSet.absolute(n) combined with
> stmt.setFetchSize(1) to retrieve a "marker" row that signifies the
> begining of each "page" of the result set. I use the primary keys of
> these "markers" as page boundaries so that my web application can
> provide links to a set of pages evenly distributes throughout the
> result set.

I use something similar, except instead of traversing the entire result set and storing keys for each 'page' I retain the keys of the first and last rows in the current 'page'.  For subsequent fetches I use '> lastKey ... order by ... asc' to scroll forwards and '< firstKey ... order by ... desc' to scroll backwards.

I too would be interested to know how that approach compares to the new offset/fetch clauses.

-- Alan Burlison
--

Re: The result offset and fetch first clauses

Posted by Alan Burlison <Al...@sun.com>.

Geoff hendrey wrote:

> Get the result set. Use a loop to increment integer n by PAGE_SIZE,
> and inside the loop use ResultSet.absolute(n) combined with
> stmt.setFetchSize(1) to retrieve a "marker" row that signifies the
> begining of each "page" of the result set. I use the primary keys of
> these "markers" as page boundaries so that my web application can
> provide links to a set of pages evenly distributes throughout the
> result set.

I use something similar, except instead of traversing the entire result 
set and storing keys for each 'page' I retain the keys of the first and 
last rows in the current 'page'.  For subsequent fetches I use '> 
lastKey ... order by ... asc' to scroll forwards and '< firstKey ... 
order by ... desc' to scroll backwards.

I too would be interested to know how that approach compares to the new 
offset/fetch clauses.

-- 
Alan Burlison
--

Re: The result offset and fetch first clauses

Posted by Tiago Espinha <ti...@espinhas.net>.

Hello Geoff,

Right now I am able to answer 1). You can indeed use this feature even if
you are sorting by several columns. The OFFSET and FETCH FIRST filtering is
applied over the sorted result set, so yes, you can sort as you would do
otherwise and these clauses will simply limit the portion of the result set
that is retrieved.

Hope it helps,
Tiago Espinha

On Sat, May 2, 2009 at 9:22 PM, Geoff hendrey <ge...@yahoo.com>wrote:

> Congratulations on implementing "The result offset and fetch first
> clauses". I'd be even more excited if I hadn't just finished implementing
> this at the application-layer on top of derby 10.4, but hey, I'm happy to
> throw away my code now that derby has this internalized.
>
> I have a couple questions before I throw away my code:
>
> 1) can I use this feature even if I am ordering on multiple columns?
>
> 2) what are the performance implications for users of the embedded driver?
> In particular, with the embedded driver I am hoping that this feature allows
> portions of a result set to be retrieved without the overhead of retrieving
> the entire result set. For example, if I have a million rows in a product
> catalog, and a user of my web app wants to sort by product name and jump to
> a particular portion of the result set, I was hoping this would be efficient
> in your implementation.
>
> If (2) is not efficient, how does it compare to the efficiency of the
> following approach:
>
> Get the result set. Use a loop to increment integer n by PAGE_SIZE, and
> inside the loop use ResultSet.absolute(n) combined with stmt.setFetchSize(1)
> to retrieve a "marker" row that signifies the begining of each "page" of the
> result set. I use the primary keys of these "markers" as page boundaries so
> that my web application can provide links to a set of pages evenly
> distributes throughout the result set.
>
> -geoff
> “XML? Too much like HTML. It'll never work on the Web!”
> -anonymous
>
>