You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Dirkjan Ochtman <di...@ochtman.nl> on 2009/11/24 20:33:43 UTC

Improving pagination for views

This weekend I tried to implement pagination for one of the modules of
a Couch-based site I maintain. Here's a rambling account of things
that I don't like and ideas about how maybe it could be improved.

In short: it kind of sucks. I usually prefer to get a list of pages so
people can jump around a little bit more, and because it provides
feedback about the size of an item list (I was working with a photo
gallery in this case). On MySQL-based databases, I'd usually issue a
single count query, which could be quite fast, to count the matching
items, then divide that by page size, use the standard LIMIT/OFFSET
approach to pagination.

On Couch, as far as I can see, pre-issuing a count query means
actually retrieving all the row data, which doesn't seem to scale
nicely.

Apparently, the alternative solution is to page sequentially, meaning
I retrieve one extra document and pass its document id so that I can
start the next page from there. That gets me a single next page link
(works okayish so far). Then, if I also want to provide a previous
page link, I'd have to either pass current startkey to the next page
(ugly URLs, more state than I'd like; also, switching on existence of
parameter instead of just parameter value + default value to check for
the first page), or retrieve twice the page size + 1.

Conclusion: I'd like to be able to specify to a view, take a little
bit more time but give me some more metadata to work with.

Currently I get total rows for the view and offset from the first
index to the current startkey. In the CouchDB book's pagination story,
that means the offset changes on every page, and using limit means I
can't really use rows.length to calculate the offset of the end of
this result set. It would be swell if I could just get the index
offset for the startkey and the endkey, ignoring skip/limit and
whatever else. Since it seems retrieval of those index offsets would
not be too expensive (Jan say O(large m + n)), it seems like a nice
idea to provide e.g. a meta=true option on views that will do the
extra work to retrieve those indexes for me and make pagination a
whole lot easier.

Sorry for the rambling, I don't think I can express it any clearer (I
tried really hard in IRC, and it still took a few hours to get some
people to understand all this). Does this sound at all sensible?

Cheers,

Dirkjan

Re: Improving pagination for views

Posted by Paul Davis <pa...@gmail.com>.
On Sat, Nov 28, 2009 at 7:38 PM, Adam Kocoloski <ko...@apache.org> wrote:
> On Nov 24, 2009, at 2:59 PM, Paul Davis wrote:
>
>> Or using the standard row count reduce:
>>
>> function(keys, values, rereduce) {
>>  if(rereduce) return sum(values);
>>  return values.length;
>> }
>>
>> It occurs to me that we should add a _row_count Erlang reduce function
>> as its slightly different than _sum.
>
> Hi Paul, it occurred to someone else, too: "_count" is already a built-in, defined as above.  Best,
>
> Adam

Well played.

Re: Improving pagination for views

Posted by Adam Kocoloski <ko...@apache.org>.
On Nov 24, 2009, at 2:59 PM, Paul Davis wrote:

> Or using the standard row count reduce:
> 
> function(keys, values, rereduce) {
>  if(rereduce) return sum(values);
>  return values.length;
> }
> 
> It occurs to me that we should add a _row_count Erlang reduce function
> as its slightly different than _sum.

Hi Paul, it occurred to someone else, too: "_count" is already a built-in, defined as above.  Best,

Adam

Re: Improving pagination for views

Posted by Paul Davis <pa...@gmail.com>.
>> Paging backwards works by using the current first key in the result
>> set and using reverse=true.
>
> Just a quick question regarding this "reverse" thing: will you get the
> records in reverse order?

Yup.

Re: Improving pagination for views

Posted by Vlad GURDIGA <gu...@gmail.com>.
On Tue, Nov 24, 2009 at 9:59 PM, Paul Davis <pa...@gmail.com> wrote:
> On Tue, Nov 24, 2009 at 2:33 PM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
>> This weekend I tried to implement pagination for one of the modules of
>> a Couch-based site I maintain. Here's a rambling account of things
>> that I don't like and ideas about how maybe it could be improved.
>>
>> In short: it kind of sucks. I usually prefer to get a list of pages so
>> people can jump around a little bit more, and because it provides
>> feedback about the size of an item list (I was working with a photo
>> gallery in this case). On MySQL-based databases, I'd usually issue a
>> single count query, which could be quite fast, to count the matching
>> items, then divide that by page size, use the standard LIMIT/OFFSET
>> approach to pagination.
>>
>> On Couch, as far as I can see, pre-issuing a count query means
>> actually retrieving all the row data, which doesn't seem to scale
>> nicely.
>
> Or using the standard row count reduce:
>
> function(keys, values, rereduce) {
>  if(rereduce) return sum(values);
>  return values.length;
> }
>
> It occurs to me that we should add a _row_count Erlang reduce function
> as its slightly different than _sum.
>
>>
>> Apparently, the alternative solution is to page sequentially, meaning
>> I retrieve one extra document and pass its document id so that I can
>> start the next page from there. That gets me a single next page link
>> (works okayish so far). Then, if I also want to provide a previous
>> page link, I'd have to either pass current startkey to the next page
>> (ugly URLs, more state than I'd like; also, switching on existence of
>> parameter instead of just parameter value + default value to check for
>> the first page), or retrieve twice the page size + 1.
>>
>
> Paging backwards works by using the current first key in the result
> set and using reverse=true.

Just a quick question regarding this "reverse" thing: will you get the
records in reverse order?

>
>> Conclusion: I'd like to be able to specify to a view, take a little
>> bit more time but give me some more metadata to work with.
>>
>> Currently I get total rows for the view and offset from the first
>> index to the current startkey. In the CouchDB book's pagination story,
>> that means the offset changes on every page, and using limit means I
>> can't really use rows.length to calculate the offset of the end of
>> this result set. It would be swell if I could just get the index
>> offset for the startkey and the endkey, ignoring skip/limit and
>> whatever else. Since it seems retrieval of those index offsets would
>> not be too expensive (Jan say O(large m + n)), it seems like a nice
>> idea to provide e.g. a meta=true option on views that will do the
>> extra work to retrieve those indexes for me and make pagination a
>> whole lot easier.
>
> I'm not sure what you mean by expensive. The current method for
> start/end keys is to do a tree descent to the start key and then do a
> linear seek along the leaves to the end key (but maybe stopping
> prematurely if we process Limit rows after we ignore Skip rows).
> Getting indexes for start and end keys could very well be done in two
> tree descents though. Or you could use a reduce query to get the row
> count.
>
> The underlying issue is that the implementation of skip isn't overly
> efficient because it requires scanning the index linearly to discard
> rows. Similar to why OFFSET in SQL can start to take time when you
> specify offsets of > 1M records depending on the query. Strictly
> speaking this is only due to the implementation. I wrote an optimized
> skip once to see if I understood how the btree worked. It basically
> worked like a fast forward by running up and down the tree instead of
> seeking linearly. I never got around to testing it thoroughly from a
> performance aspect as it was just to prove to myself I understood how
> the couch_btree:stream* functions worked. And its not implementable
> with grouped reduce queries since we don't persist the reduce results
> in a btree.
>
>> Sorry for the rambling, I don't think I can express it any clearer (I
>> tried really hard in IRC, and it still took a few hours to get some
>> people to understand all this). Does this sound at all sensible?
>>
>> Cheers,
>>
>> Dirkjan
>>
>
> HTH,
> Paul Davis
>

Re: Improving pagination for views

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Nov 24, 2009 at 2:33 PM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
> This weekend I tried to implement pagination for one of the modules of
> a Couch-based site I maintain. Here's a rambling account of things
> that I don't like and ideas about how maybe it could be improved.
>
> In short: it kind of sucks. I usually prefer to get a list of pages so
> people can jump around a little bit more, and because it provides
> feedback about the size of an item list (I was working with a photo
> gallery in this case). On MySQL-based databases, I'd usually issue a
> single count query, which could be quite fast, to count the matching
> items, then divide that by page size, use the standard LIMIT/OFFSET
> approach to pagination.
>
> On Couch, as far as I can see, pre-issuing a count query means
> actually retrieving all the row data, which doesn't seem to scale
> nicely.

Or using the standard row count reduce:

function(keys, values, rereduce) {
  if(rereduce) return sum(values);
  return values.length;
}

It occurs to me that we should add a _row_count Erlang reduce function
as its slightly different than _sum.

>
> Apparently, the alternative solution is to page sequentially, meaning
> I retrieve one extra document and pass its document id so that I can
> start the next page from there. That gets me a single next page link
> (works okayish so far). Then, if I also want to provide a previous
> page link, I'd have to either pass current startkey to the next page
> (ugly URLs, more state than I'd like; also, switching on existence of
> parameter instead of just parameter value + default value to check for
> the first page), or retrieve twice the page size + 1.
>

Paging backwards works by using the current first key in the result
set and using reverse=true.

> Conclusion: I'd like to be able to specify to a view, take a little
> bit more time but give me some more metadata to work with.
>
> Currently I get total rows for the view and offset from the first
> index to the current startkey. In the CouchDB book's pagination story,
> that means the offset changes on every page, and using limit means I
> can't really use rows.length to calculate the offset of the end of
> this result set. It would be swell if I could just get the index
> offset for the startkey and the endkey, ignoring skip/limit and
> whatever else. Since it seems retrieval of those index offsets would
> not be too expensive (Jan say O(large m + n)), it seems like a nice
> idea to provide e.g. a meta=true option on views that will do the
> extra work to retrieve those indexes for me and make pagination a
> whole lot easier.

I'm not sure what you mean by expensive. The current method for
start/end keys is to do a tree descent to the start key and then do a
linear seek along the leaves to the end key (but maybe stopping
prematurely if we process Limit rows after we ignore Skip rows).
Getting indexes for start and end keys could very well be done in two
tree descents though. Or you could use a reduce query to get the row
count.

The underlying issue is that the implementation of skip isn't overly
efficient because it requires scanning the index linearly to discard
rows. Similar to why OFFSET in SQL can start to take time when you
specify offsets of > 1M records depending on the query. Strictly
speaking this is only due to the implementation. I wrote an optimized
skip once to see if I understood how the btree worked. It basically
worked like a fast forward by running up and down the tree instead of
seeking linearly. I never got around to testing it thoroughly from a
performance aspect as it was just to prove to myself I understood how
the couch_btree:stream* functions worked. And its not implementable
with grouped reduce queries since we don't persist the reduce results
in a btree.

> Sorry for the rambling, I don't think I can express it any clearer (I
> tried really hard in IRC, and it still took a few hours to get some
> people to understand all this). Does this sound at all sensible?
>
> Cheers,
>
> Dirkjan
>

HTH,
Paul Davis

Re: Improving pagination for views

Posted by Brian Candler <B....@pobox.com>.
On Tue, Nov 24, 2009 at 08:33:43PM +0100, Dirkjan Ochtman wrote:
> On Couch, as far as I can see, pre-issuing a count query means
> actually retrieving all the row data, which doesn't seem to scale
> nicely.

Not true: you have a count reduce function. Because each node of the Btree
contains the already-calculated reduce value, this is very efficient. At
best, it just reads the value out of the root node. At worst, for a key
range, it will combine (re-reduce) count values for the subtrees within the
key range, and count a few loose values at each end of the range.

Where it's inefficient is when retrieving: if I understand it correctly,
using offset=50000 *will* require walking through 50,000 rows in the index.

That's unless each Btree node has a calculated 'number of rows' value which
would allow chunks of tree to be skipped over quickly. I haven't heard
anyone say that it does, but I could be wrong.

For now, the best advice seems to be use 'next' and 'prev' links, using
descending=true to get the values for the previous page.