You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Jens Alfke <je...@mooseyard.com> on 2010/02/26 17:27:13 UTC

Consistent version of a view across multiple queries?

If an app wants to iterate over a large view, it seems better to page the output by issuing multiple queries, using the startkey= and limit= parameters. However, this seems to introduce race conditions if another client is meanwhile altering the database. I might see half of the documents before the change and half after. For example, I might see a document show up twice with two different key values.

Is there any way to avoid this inconsistency? In a SQL database I'd use a transaction for this, to lock out any database updates in between my series of SELECTs. But CouchDB's architecture doesn't support that.

It seems like what I want is to specify some kind of clock (timestamp / revision #) in my view queries, so they all run over the exact same view b-tree. This seems straightforward at the level of the CouchDB file-format, since it's append-only and the previous view b-tree still exists in the file. But is this exposed in the API at all?

—Jens

Re: Consistent version of a view across multiple queries?

Posted by Chris Anderson <jc...@apache.org>.

On Sat, Feb 27, 2010 at 9:06 AM, Jens Alfke <je...@mooseyard.com> wrote:
>
> On Feb 26, 2010, at 10:56 AM, Chris Anderson wrote:
>
>> No, but it should be. I've been tijnkjng about this for a while.
>
> Cool :) My immediate idea is to return a _rev key in a view result, like a document, whose value changes each time the view is rebuilt. In a query you could optionally add something like "&rev=" to specify which revision to use.
>

We should definitely be discussing this on dev@

In a nutshell what we've discussed before is basing the view etag on
the last seq-id of the database which changed anything in the index.
We already track this at a view-group (design document) level but
don't expose it. To do it for a single view in a group, we'd have to
do some additional coding.

> Of course now you have to store a mapping from revision numbers to the location of the view's tree in the db file. A quick and dirty way to do this might be to optimize for only recently-obsoleted view results, and just chain the results in a linked list. So the internal data for each view b-tree would contain its _rev value, and the position in the file of the previous generation tree. [I don't know the details of the file format, though, so this might not make sense.]
>
>> Main complication is that the old seq might not be available if a view compaction completes in between queries.
>
> Yeah, eventually you always run into that :/ Maybe compaction could optionally preserve the last couple of generations of a view? Or just specific generations that have been actively used in queries in the last N minutes?

All this sounds doable at a cost of complexity. But we are getting
toward the time to think about "post-1.0" optimization etc, so it's
worth researching.

Chris

>
> —Jens



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Consistent version of a view across multiple queries?

Posted by Jens Alfke <je...@mooseyard.com>.

On Feb 26, 2010, at 10:56 AM, Chris Anderson wrote:

> No, but it should be. I've been tijnkjng about this for a while.

Cool :) My immediate idea is to return a _rev key in a view result, like a document, whose value changes each time the view is rebuilt. In a query you could optionally add something like "&rev=" to specify which revision to use.

Of course now you have to store a mapping from revision numbers to the location of the view's tree in the db file. A quick and dirty way to do this might be to optimize for only recently-obsoleted view results, and just chain the results in a linked list. So the internal data for each view b-tree would contain its _rev value, and the position in the file of the previous generation tree. [I don't know the details of the file format, though, so this might not make sense.]

> Main complication is that the old seq might not be available if a view compaction completes in between queries.

Yeah, eventually you always run into that :/ Maybe compaction could optionally preserve the last couple of generations of a view? Or just specific generations that have been actively used in queries in the last N minutes?

—Jens

Re: Consistent version of a view across multiple queries?

Posted by Chris Anderson <jc...@gmail.com>.


On Feb 26, 2010, at 8:27 AM, Jens Alfke <je...@mooseyard.com> wrote:

> If an app wants to iterate over a large view, it seems better to  
> page the output by issuing multiple queries, using the startkey= and  
> limit= parameters. However, this seems to introduce race conditions  
> if another client is meanwhile altering the database. I might see  
> half of the documents before the change and half after. For example,  
> I might see a document show up twice with two different key values.
>
> Is there any way to avoid this inconsistency? In a SQL database I'd  
> use a transaction for this, to lock out any database updates in  
> between my series of SELECTs. But CouchDB's architecture doesn't  
> support that.
>
> It seems like what I want is to specify some kind of clock  
> (timestamp / revision #) in my view queries, so they all run over  
> the exact same view b-tree. This seems straightforward at the level  
> of the CouchDB file-format, since it's append-only and the previous  
> view b-tree still exists in the file. But is this exposed in the API  
> at all?

No, but it should be. I've been tijnkjng about this for a while. Main  
complication is that the old seq might not be available if a view  
compaction completes in between queries.

Chris

>
> —Jens

Re: Consistent version of a view across multiple queries?

Posted by Robert Newson <ro...@gmail.com>.

You can query with stale=ok and the view won't change (as long as no
other call happens without stale=ok). You'll have to call without
stale=ok sometimes, though, so you'll still need to take care. Does
that help?

B.

On Fri, Feb 26, 2010 at 11:27 AM, Jens Alfke <je...@mooseyard.com> wrote:
> If an app wants to iterate over a large view, it seems better to page the output by issuing multiple queries, using the startkey= and limit= parameters. However, this seems to introduce race conditions if another client is meanwhile altering the database. I might see half of the documents before the change and half after. For example, I might see a document show up twice with two different key values.
>
> Is there any way to avoid this inconsistency? In a SQL database I'd use a transaction for this, to lock out any database updates in between my series of SELECTs. But CouchDB's architecture doesn't support that.
>
> It seems like what I want is to specify some kind of clock (timestamp / revision #) in my view queries, so they all run over the exact same view b-tree. This seems straightforward at the level of the CouchDB file-format, since it's append-only and the previous view b-tree still exists in the file. But is this exposed in the API at all?
>
> —Jens