You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Nicholas Retallack <ni...@gmail.com> on 2008/11/17 04:54:48 UTC

multiple keys and sorting

I ran into this problem today where I wanted to find documents by one
criteria but sort and paginate them by another.  I'll simplify my example to
make it easier to explain.

Lets say we have documents of the form {"dude":"joe", "date":"2008-11-11"},
and there are hundreds of them.  Now, lets say we have a set of dudes
["joe", "bob", "fred"] and we want to find all the documents among the three
of them.  That's easy enough in 3 queries.  But what if we want to sort by
date and paginate it?  Can't really do it.  So, how to work around it?

We could do 3 queries anyway, each with the maximum number of documents on
the page, then sort them ourselves and throw away the extras.  This gets
messy fast though, when you try to keep track of where to start the second
page in each query.

We could turn the relation around and make dude_groups a field in our
document, and have a view keyed by this and the date.  This works okay if
there is a fixed set of dude groups you'll be looking for.  However, if you
decide to change who's in one of the dude groups, you have a massive update
on your hands.

What would you guys suggest?

Re: multiple keys and sorting

Posted by Simon Metson <si...@googlemail.com>.

> These second level caches could be query-able just like views are,  
> but one
> level removed.  We could even specify them with permanent javascript  
> methods
> so you could query on them instead of the underlying views.  What do  
> you
> guys think?

This is exactly the kind of thing I'd like to use in my application.  
We will have billions of records and being able to create "views of  
views" would be absolutely ace. We have a sort of three tiered  
hierarchy of records, and at some point a tier in the hierarchy is  
completed, hence the result of the view for that bit of the collection  
won't change much. I could store the values of some map/reduces in the  
higher tier but I'm pretty sure that it will end up getting  
inconsistent.
I'd be happy to put some effort into "views of views", but first I  
need to learn Erlang (or get one of our the other developers to ;)).
Cheers
Simon

Re: multiple keys and sorting

Posted by Nicholas Retallack <ni...@gmail.com>.

I ended up just doing all the queries every time, sorting the results, and
clipping them.  That means I fetched all the pages on each page.  It was
fast enough.

An easy improvement would be to do the multiple queries, sort them, and then
cache that somewhere similar to a view in case the user asks for another
page.  It would be great if that cache also knew when it should be
invalidated -- that is, when the underlying views changed.  Also, it would
be even cooler if it calculated how often this cached set was needed versus
others, so one-offs didn't stick around too long.

These second level caches could be query-able just like views are, but one
level removed.  We could even specify them with permanent javascript methods
so you could query on them instead of the underlying views.  What do you
guys think?

On Mon, Nov 17, 2008 at 7:26 AM, Chris Anderson <jc...@apache.org> wrote:

> On Sun, Nov 16, 2008 at 7:54 PM, Nicholas Retallack
> <ni...@gmail.com> wrote:
> > We could do 3 queries anyway, each with the maximum number of documents
> on
> > the page, then sort them ourselves and throw away the extras.  This gets
> > messy fast though, when you try to keep track of where to start the
> second
> > page in each query.
> >
> > What would you guys suggest?
> >
>
> I think the smart money would be in working this way, with keys like
> ["joe", "2008-11-1"] and running 3 queries. You're right about the
> tough part being paginating the subsequent pages, because you don't
> know where to start with each one until they've been merged and
> sorted.
>
> Unless you can come up with a collation key that does the work
> magically ("dude_groups") any kind of multi-range extension to couch
> would have the same problems your application is facing. This is the
> type of thing that belongs in a library, I think.
>
> An interesting place to put the library might be as an _external
> action, if you really want to keep it out of the client, but external
> _actions have the dubious distinction of using server resources to do
> what could be equally well done on the client. (Also they are not yet
> in CouchDB trunk.)
>
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: multiple keys and sorting

Posted by Chris Anderson <jc...@apache.org>.

On Sun, Nov 16, 2008 at 7:54 PM, Nicholas Retallack
<ni...@gmail.com> wrote:
> We could do 3 queries anyway, each with the maximum number of documents on
> the page, then sort them ourselves and throw away the extras.  This gets
> messy fast though, when you try to keep track of where to start the second
> page in each query.
>
> What would you guys suggest?
>

I think the smart money would be in working this way, with keys like
["joe", "2008-11-1"] and running 3 queries. You're right about the
tough part being paginating the subsequent pages, because you don't
know where to start with each one until they've been merged and
sorted.

Unless you can come up with a collation key that does the work
magically ("dude_groups") any kind of multi-range extension to couch
would have the same problems your application is facing. This is the
type of thing that belongs in a library, I think.

An interesting place to put the library might be as an _external
action, if you really want to keep it out of the client, but external
_actions have the dubious distinction of using server resources to do
what could be equally well done on the client. (Also they are not yet
in CouchDB trunk.)

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: multiple keys and sorting

Posted by Adam Groves <ad...@gmail.com>.

Hi Nicholas,

I just ran into the sort by date thing last night. My solution was:

function(doc) {
  time = doc.created_at.replace(/\D/g, "");
  emit([doc.document_id, time], doc)
}

to convert my time stamp (also in the form of a string like yours)
into a number which could be sorted.

Regarding pagination, see my blog post:
http://addywaddy.posterous.com/couchdb-and-pagination

Cheers!

Adam

2008/11/17 Nicholas Retallack <ni...@gmail.com>:
> I ran into this problem today where I wanted to find documents by one
> criteria but sort and paginate them by another.  I'll simplify my example to
> make it easier to explain.
>
> Lets say we have documents of the form {"dude":"joe", "date":"2008-11-11"},
> and there are hundreds of them.  Now, lets say we have a set of dudes
> ["joe", "bob", "fred"] and we want to find all the documents among the three
> of them.  That's easy enough in 3 queries.  But what if we want to sort by
> date and paginate it?  Can't really do it.  So, how to work around it?
>
> We could do 3 queries anyway, each with the maximum number of documents on
> the page, then sort them ourselves and throw away the extras.  This gets
> messy fast though, when you try to keep track of where to start the second
> page in each query.
>
> We could turn the relation around and make dude_groups a field in our
> document, and have a view keyed by this and the date.  This works okay if
> there is a fixed set of dude groups you'll be looking for.  However, if you
> decide to change who's in one of the dude groups, you have a massive update
> on your hands.
>
> What would you guys suggest?
>