You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by David Van Couvering <da...@vancouvering.com> on 2009/03/19 06:50:24 UTC

Bulk updates and eventual consistency

Hi, all.  I'm working on updating the Wiki to describe the new behavior of
bulk updates.

I read the very (very) long thread about Damien's change to the
transactional semantics around _bulk_docs, and I understand the situation
pretty well (I think).  But there's one part of the discussion that I wanted
to make sure I had correct.

My understanding is that one motivation for bulk update may be because you
have referential dependencies between docs.  If there are no conflicts, then
you can be assured those references will be consistent on the database where
you do the bulk update (with all-or-nothing), *but* they may not immediately
be consistent on replicas.  This is because a bulk update is not replicated
all-or-nothing, but instead each document is replicated independently, in an
unspecified order.  So you will have a temporary state of affairs where the
references between documents may be inconsistent, but eventually they do
become consistent (for that particular bulk update).

*But* if you *do* have conflicts in a bulk update, then it is quite possible
that the choice of winners for the conflict will cause a referential
inconsistency between documents.  In this case, the inconsistency will *not*
automatically become eventually consistent, but will require intervention by
the application to resolve the documents to a consistent state.

This can happen even when you are not using replication at all, but you have
two simultaneous sessions update the same document.

In the previous implementation, bulk_update rolled back if there were any
"local" conflicts, so you were guaranteed of referential consistency between
docs on the database instance where you applied the bulk update.  However,
you could still end up in a pickle if replication caused a conflict -- now
you are back in the same place with referential inconsistency that has to be
manually resolved.

Do I have that right?

I am uncomfortable about asking the next question, as I feel I am opening up
a can of worms, but I am missing what problem was solved by allowing
all-or-nothing to succeed on conflicts.   It seems like in both models you
have eventual consistency and interim states where documents are
inconsistent, but at least with the old approach you were guaranteed
consistency on the database instance where you did the bulk update.  That
seems like it could be pretty handy, particularly for deployments where you
are not doing replication.

My apologies if this was already answered in that very long thread, but
perhaps someone can summarize for me...

Thanks,

David

-- 
David W. Van Couvering

I am looking for a senior position working on server-side Java systems.
 Feel free to contact me if you know of any opportunities.

http://www.linkedin.com/in/davidvc
http://davidvancouvering.blogspot.com
http://twitter.com/dcouvering

Re: Bulk updates and eventual consistency

Posted by Antony Blakey <an...@gmail.com>.
I am attempting to keep this: http://github.com/AntonyBlakey/couchdb/tree/transactional_bulk_docs 
  reasonably up-to-date with trunk. It provides transactional  
_bulk_docs if you add "fail_on_conflict": true to the top-level JSON  
body in the request. If fails with a 419 if there are any conflicts  
(and makes no db changes). The Conflict data is threaded back to the  
http response point with the intention of returning it as the response  
body, but I've not done that yet. Patch welcome.

The mod is designed to be easy to maintain wrt. trunk.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

He who would make his own liberty secure, must guard even his enemy  
from repression.
   -- Thomas Paine



Re: Bulk updates and eventual consistency

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Apr 1, 2009 at 5:23 AM, Hagen Overdick <si...@gmail.com> wrote:
>>
>> IMO this is a questionable decision, but I'm in the minority.
>
>
>  Guess, after much thought about this, I am joining the minority.
>
> I base my argumentation on this excellent paper:
> http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf
>
> In essence, Pat Helland recommends to identify _entities_ which represent
> the maximum scope of _local_ serializability. Thanks to the bulk update
> mechanism, this used to be a whole couchdb, with the changes given, an
> entity maps to a single document now.

Correct. CouchDB is a key/value store. A database is just a namespace
for keys, and the boundary of map/reduce operations.

> So, what's an entity for CouchDB? I very much prefer a whole db

It was perhaps a mistake in managing expectations, to expose the
bulk-transactions API. My impression of the reason behind this is that
it made testing some low level file behavior more convenient in the
short term.

To provide an alternate viewpoint on this question, I remember using
CouchDB _before_ bulk-docs became transactional, and being
disappointed that what used to be an easy way to get data into CouchDB
was now failing even if just one of my documents had a conflict. In
the old days, bulk-docs worked a lot like it does in 0.9, and I found
this more relaxing for my web spidering use case.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Bulk updates and eventual consistency

Posted by Hagen Overdick <si...@gmail.com>.
>
> IMO this is a questionable decision, but I'm in the minority.


 Guess, after much thought about this, I am joining the minority.

I base my argumentation on this excellent paper:
http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf

In essence, Pat Helland recommends to identify _entities_ which represent
the maximum scope of _local_ serializability. Thanks to the bulk update
mechanism, this used to be a whole couchdb, with the changes given, an
entity maps to a single document now.

The reason given here is sharding a single database, a concept which I would
refuse, because it breaks the idea of a database as an entity in the first
place. Btw, the reasoning that let to the removal of bulk_transactions can
be applied to the single update as well, there is just no guarantee there
won't be a conflicting update somewhere in the distributed environment.
Also, I don't really see, how you want to provide all_or_nothing semantics
assuming a sharded database.

So, what's an entity for CouchDB? I very much prefer a whole db, because I
can have partial updates (which is exactly what the old bulk_transaction
provided). I don't want to use this for referencial integrity, but local
serializability of updates. If you remove that, you will either force people
to bad design (keeping everything in a single document and eventually ask
for partial updates) or force them to replicate this functionality outside
of CouchDB, leading to ugly clutches.


Just my 2 Eurocents
Hagen
-- 
Dissertations are a successful walk through a minefield -- summarizing them
is not. - Roy Fielding

Re: Bulk updates and eventual consistency

Posted by David Van Couvering <da...@gmail.com>.
OK, thanks, that is clear.

It's sort of guaranteeing a "binary compatibility" between single node and
multi-node solutions, where you don't paint yourself into a corner when just
working in a single node.

David

On Thu, Mar 19, 2009 at 12:03 AM, Antony Blakey <an...@gmail.com>wrote:

>
> On 19/03/2009, at 4:20 PM, David Van Couvering wrote:
>
>  My apologies if this was already answered in that very long thread, but
>> perhaps someone can summarize for me...
>>
>
> It is intended that the difference between single-node and multi-node
> cluster operation not be exposed to clients, to ensure that there are no
> single-node-only applications which don't scale to clustered operation. This
> means that "deployments where you are not doing replication" isn't a
> relevant distinction as far as the CouchDB model is concerned.
>
> IMO this is a questionable decision, but I'm in the minority.
>
> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> The difference between ordinary and extraordinary is that little extra.
>
>
>


-- 
David W. Van Couvering

I am looking for a senior position working on server-side Java systems.
 Feel free to contact me if you know of any opportunities.

http://www.linkedin.com/in/davidvc
http://davidvancouvering.blogspot.com
http://twitter.com/dcouvering

Re: Bulk updates and eventual consistency

Posted by Antony Blakey <an...@gmail.com>.
On 19/03/2009, at 4:20 PM, David Van Couvering wrote:

> My apologies if this was already answered in that very long thread,  
> but
> perhaps someone can summarize for me...

It is intended that the difference between single-node and multi-node  
cluster operation not be exposed to clients, to ensure that there are  
no single-node-only applications which don't scale to clustered  
operation. This means that "deployments where you are not doing  
replication" isn't a relevant distinction as far as the CouchDB model  
is concerned.

IMO this is a questionable decision, but I'm in the minority.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The difference between ordinary and extraordinary is that little extra.