You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Yuval Kogman <no...@woobling.org> on 2009/05/21 13:00:45 UTC

reiterating transactions vs. replication

Hello,

In 0.9 CouchDB removed the transactional bulk docs feature in favour
of simplifying sharding/replication.

The priorities behind this decision as I understood them are:

    1. ensure that applications developed in a single server don't
suffer from a degradation of guarantees if deployed using sharding

    2. avoid the issues involving transactional


I apologize if this proposal has already dismissed before. I did
search the mailing list archives, but mostly found a discussion on why
this stuff should not be done on IRC. I blame Jan for encouraging me
to post ;-)



So anyway, I think that we can have both features without needing to
implement something like Paxos, and without silently breaking apps
when they move to a sharding setup from a single machine setup.


The basic idea is to keep the conflicts-are-data approach, keeping the
current user visible replication and conflict resolution, but to allow
the user to ask for stricter conflict checking.

The api I propose is for the bulk docs operation to have an optional
'atomic' flag. When this flag is set CouchDB would atomically verify
that all documents were committed without conflict (with respect to
the supplied _rev), and if any one document conflicts, mark all of
them as conflicting.

Transaction recovery, conflict resolution etc is still the
responsibility of the application, but provides an atomic guarantee
that an inconsistent transaction will fail as a whole if it tries to
write inconsistent data to the database, a guarantee that cannot be
made using a client library (there are race conditions).



Now the hard parts:


1. Replication

The way I understand it replication currently works on a per document
approach. If 'atomic' was specified in a bulk operation I propose that
all the revisions created in that bulk operation were kept linked. If
these linked revisions are being replicated, the same conflict
resolution must be applied (the replication of the document itself is
executed as bulk operation with aotmic=true, replicating all
associated documents as well).

The caveat is that even if you always use bulk docs with the atomic
flag, if you a switch replica you could lose the D out of ACID:
documents which are marked as non conflicting in your replica might be
conflicting in the replica you switch to, in which case transactions
that have already been committed appear to be rolled back from the
application's point of view.

This problem obviously already exists in the current implementation,
but when 'atomic' is specified it could potentially happen a lot more
often.


2. Data sharding

This one is tricker. Two solutions both of which I think are
acceptable, and either or both of which could be used:


The easy way is to ignore this problem. Well not really: The client
must ensure that all the documents affected by a single transaction
are in the same shard, by using a partitioning scheme that allows
this.

If a bulk_docs operation with atomic set to true would affect multiple
shards, that is an error (the data could still be written as a
conflict, of course).

If you want to enable the 'atomic' flag you'll need to be careful
about how you use sharding. You can still use it for some of the
transactions, but not all the time. I think this is a flexible and
pragmatic solution.

This means that if you choose to opt in to fully atomic bulk doc
operations your app might not be deployable unmodified to a sharded
setup, but it's never unsafe (no data inconsistencies).

In my opinion this satisfies the requirement for no degredation of
guarantees. It might not Just Work, but you can't have your cake and
eat it too at the end of the day.




The second way is harder but potentially still interesting. I've
included it mostly for the sake of discussion.

The core idea is to provide low level primitives on top of which a
client or proxy can implement a multi phase commit protocol.

The number of nodes involved is in the transaction depends on the data
in the transaction (it doesn't need to coordinate all the nodes in the
cluster).

Basically this would breakdown bulk doc calls into several steps.
First all the data is inserted to the backend, but it's set as
conflicting so that it's not accidentally visible.

This operation returns an identifier for the bulk doc operation
(essentially a ticket for a prepared transaction).

Once the data is available on all the shards it must be made live
atomically. A two phase commit starts by acquiring locks on all the
the transaction tickets and trying to apply them (the 'promise'
phase), and then finalizing that application atomically (the 'accept'
phase).

To keep things simple the two phases should be scoped to a single keep
alive connection. If the connection drops the locks should be
released.

Obviously Paxos ensues, but here's the catch:

 - The synchronization can be implemented first as a 3rd party
component, it doesn't need to affect CouchDB's core

 - The atomic primitives are also useful for writing safe conflict
resolution tools that handle conflicts that span multiple documents.

So even if no one ends up implementing real Multi Paxos in the end
CouchDB still benefits from having reusable synchronization
primitives. (If this is interesting to anyone, see below [1])




I'd like to stress that this is all possible to do on top of the
current 0.9 semantics. The default behavior in 0.9 does not change at
all. You have to opt in to this more complex behavior.

The problem with 0.9 is that there is no way to ensure atomicity and
isolation from a client library, it must be done on the server, so by
removing the ability to do this at all, couchdb is essentially no
longer transaction. It's protected from internal data corruption, and
it's protected from data loss (unlike say MyISAM which will happily
overwrite your correct data), but it's still a potentially lossy model
since conflicting revisions are not correlated. This means that you
cannot have a transactional graph model, it's either or.



Finally, you should know that I have no personal stake in this. I
don't rely on CouchDB (yet), but I think it's somewhat relevant for a
project I work on, and that the requirements for this project are not
that far fetched. I'm the author of an object graph storage engine for
Perl called KiokuDB. It serializes every object to a document unless
told otherwise but the data is still a highly interconnected graph. As
a user of this system I care a lot about transactions, but not at all
about sharding (this might not hold for all the users of KiokuDB).
Basically I already have what I need from KiokuDB; there are numerous
backends for this system that satisfy me (BerkeleyDB, a transactional
plain file backend, DBI (PostgreSQL, SQLite, MySQL)), and a number
that don't fit my personal needs due to lack of transactions
(SimpleDB, MongoDB, and CouchDB since 0.9).

If things stays this way, then CouchDB is simply not intended for
users like me (though I'll certainly still maintain
KiokuDB::Backend::CouchDB).

However, I do *want* to use CouchDB. I think that under many scenarios
it has clear advantages compared to the other backends (mostly the
fact that it's so easy, but also views support is nice). I think it'd
be a shame if what was preventing me was a fix that ended up being a
low hanging fruit to which no one objected.


Regards,
Yuval



[1] in Paxos terms the CouchDB shards would do the Acceptor role and
the client (be it the actual client or a sharding proxy, whatever
delegates and consolidates the views) performs the the Learner role.
Only the Learner is considered authoritative with respect to the final
status of a transaction.

Hard crashes of a shard during the 'accept' phase may produce
inconsistent results if more than one Learner is used to proxy write
operations. Focusing on this scenario is a *HUGE* pessimization. High
availability of the Learner role can still be achieved using
BerkeleyDB style master failover (voting).

This transactional sharding proxy could of course also guarantee
redundancy of shards.

My point is that Paxos can be implemented as a 3rd party component if
anyone actually wants/needs it, by providing comparatively simple
primitives.

Re: reiterating transactions vs. replication

Posted by Yuval Kogman <no...@woobling.org>.

2009/5/25 Antony Blakey <an...@gmail.com>:

> There is another solution to this problem that I posted to this list about
> 16 March 2009. In short:

Personally at that point I would reassess my usage of CouchDB
altogether, it just doesn't seem appropriate anymore. Whether or not
couchdb should be changed to fit this requirement is the obviously the
topic of discussion, but in the short term I would probably give up
and try something else. This really reminds me of nested set or
polymorphism hacks in SQL.

Re: reiterating transactions vs. replication

Posted by Antony Blakey <an...@gmail.com>.

On 25/05/2009, at 3:01 PM, Scott Shumaker wrote:

> 'A' maintains an ordered list of 'B' elements, where the order is
> user-defined - imagine allowing a user to re-order photos in a
> slideshow.  You want to store the photos separately from the
> slideshow, because they can be used in multiple slideshows.  Whenever
> you add or remove a photo, you need to update the order list as well.
>
> I've seen some people suggest some sort of gross solution where you
> try to store floating point order id's inside the B elements and
> change that to wedge an item in between another items (averaging the
> two new siblings' orders), but this is incredibly brittle and breaks
> down with enough re-ordering.

There is another solution to this problem that I posted to this list  
about 16 March 2009. In short:

The general solution is to treat (in abstract) the ordering of items  
as the in-order traversal of a binary tree. The brief form of the  
algorithm is to record in each item the path from the top as two bits  
e.g. 10 = termination, 01 = left, 11 = right. You then map that bit  
sequence, padded with 0, to an encoded form that preserves the  
ordering. Avoiding unbounded length requires a balanced tree, which  
requires transactional support. It has the benefit of a low number of  
documents touched per update (in an amortized sense).

By using 01 = left, 10 = termination, 11 = right, the length of the  
bit string becomes implicit (self-terminating) i.e. every pair  
contains a 1, and thus a function to compute an intermediate value  
given two *byte* sequences is easy. In practice you need to know the  
two adjacent values in order to avoid collisions, but you don't need  
to write to those documents.

This isn't brittle and won't break down, but as I say the insertion  
keys are unbounded.

----------------------------------------------------------------

I maintain a couchdb fork with transactional bulk doc commits, but  
they are really only useful to avoid requiring a merge interface for  
local operations. CouchDB replication still generates conflicts. If  
however you use a single write master, then transaction bulk doc  
commits can eliminate all conflicts.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

It's amazing that the one side of the conversation that survived is "I  
don't know art, but I know what I like". The reply from the artist was  
"Madam, so does a cow".
   -- Carl Kirkendall

Re: reiterating transactions vs. replication

Posted by Paul Davis <pa...@gmail.com>.

On Mon, May 25, 2009 at 1:44 AM, Scott Shumaker <ss...@gmail.com> wrote:
> Chris - you said:
> "Distributed bulk transactions would make for chaotic behavior, as
> someone's mostly unrelated change on a remote node could eventually
> replicate to me (months later) and knock an entire line of work that
> I've done into a conflict state."
>
> Months later?  If you're really out of sync with a remote node for
> several months, you should expect to get lots of conflicts if people
> are editing your shared documents, with or without distributed bulk
> transactions.
>

I'm still mulling over Chris's concerns. My gut-brain tells me there's
a way around this but the idea hasn't floated up to the same brain
that controls actual thoughts.

> And if you really have a system designed to be offline for several
> months, you're solving very different problems (albeit interesting
> ones) than traditional data 'replication' as most people think about
> it (when one entity controls the servers, even if they are
> geographically distributed) - and getting into DVCS-like territory.
>

Heh. I think even in DVCS territory there'd still probably be a murder
if a dev didn't push for a couple months. Which is part of the idea
floating around. Ie, perhaps having a method for rejecting replication
that is too far out of whack (Think of git's "You can't push because
its not a fast forward") (This is all extreme hand waving so no one
take that too seriously).

> If that's the case, it does explain why there are some decisions in
> CouchDB I find strange, like the fact that authentication runs during
> replication.  You're using replication for a completely different
> purpose than I need it for - I need it for redundancy and read
> scaling, you're using it to synchronize disparate data sources.  Very
> different problems that call for very different solutions.
>


CouchDB has always had a focus on being able to be run in a completely
decentralized manner. As such there are features in CouchDB that
support this model. That said, it's also a goal of supporting as wide
a range of models as possible that fit into the general scheme of
things. For instance, if replication validation is something you want
to have configurable then you can create a ticket and dev@ thread
discussing the issue. I don't detect any issues in having that
disable-able but we should discuss that on a dedicated thread for
future-me's sanity.

HTH,
Paul Davis


> On Sun, May 24, 2009 at 10:31 PM, Scott Shumaker <ss...@gmail.com> wrote:
>> Inter-document dependencies come up pretty quickly when you start
>> trying to represent complex data structures in CouchDB.  There are
>> still a few cases we've encountered where there isn't a great way to
>> avoid needing transactions.  A few examples:
>>
>> 1)
>> 'A' maintains an ordered list of 'B' elements, where the order is
>> user-defined - imagine allowing a user to re-order photos in a
>> slideshow.  You want to store the photos separately from the
>> slideshow, because they can be used in multiple slideshows.  Whenever
>> you add or remove a photo, you need to update the order list as well.
>>
>> I've seen some people suggest some sort of gross solution where you
>> try to store floating point order id's inside the B elements and
>> change that to wedge an item in between another items (averaging the
>> two new siblings' orders), but this is incredibly brittle and breaks
>> down with enough re-ordering.
>>
>> Another unpleasant approach is to create separate 'order objects' in
>> couchdb (representing the order of an item within a folder), storing
>> an internal version number (timestamp) inside the 'order object' - so
>> you never change the order node, you just create a new node.  Then,
>> you only use use the 'latest' version of this order node (either on
>> the client side or with a reduce).  To make this work, you need to
>> ensure that your 'internal version numbers' are monotonically
>> increasing.  This isn't a problem for some applications, and can be
>> solved in general with a specialized 'number server'.
>>
>> 2)
>> Representing graph/linked-list datastructures.
>>
>> If you delete a node from a linked list, you need to update two nodes
>> - the previous node and the node itself.  You can try the second
>> suggestion in the previous item to make this work (effectively, store
>> the link relationship as separate objects and generate new link
>> objects with incrementing version numbers)
>>
>> I'm sure there are other cases - these two just have been a thorn in
>> our side.  But for a lot of non-trivial data applications,
>> transactions still end up being very useful.
>>
>> On Fri, May 22, 2009 at 2:45 PM, Nathan Stott <nr...@gmail.com> wrote:
>>> As a user, when I chose couchdb for my most recent project, I chose it
>>> because I didn't care about transactions.  I would've used RDBMS if that
>>> were important.
>>> I chose it because couch solved the problems I needed solved very well.
>>>
>>> I don't think transactions should be a big dev focus.
>>>
>>> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson <jc...@apache.org> wrote:
>>>
>>>> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman <no...@woobling.org>
>>>> wrote:
>>>> > 2009/5/21 Adam Kocoloski <ko...@apache.org>:
>>>> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>>>> >> rehash all the discussion from back in February (see the thread
>>>> beginning at
>>>> >>
>>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e
>>>>  for
>>>> >> a particularly detailed discussion), but I do want to comment on one
>>>> aspect.
>>>> >>
>>>> >> Updating the replicator to be smart about atomic bulk transactions is
>>>> doable
>>>> >> (although a major undertaking), but when you throw DB compaction and
>>>> >> revision stemming into the mix things get really hairy.  Recall that
>>>> CouchDB
>>>> >> revisions are used for concurrency control, not for maintaining history.
>>>> >>  Consider the following sequence of events:
>>>> >>
>>>> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>>>> >> 2) Update foo -> foo/2
>>>> >> Compact the DB (foo/1 is deleted)
>>>> >> Start replicating to a mirror
>>>> >> Replication crashes before it reaches foo/2
>>>> >
>>>> > By crash you mean an error due to a conflict between foo/2 and foo/1'
>>>> > (the mirror's version of foo), right?
>>>> >
>>>> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>>>>  I
>>>> >> think this means we'd need to modify the compaction algorithm to keep
>>>> >> revisions of documents if a) the revision was part of an atomic
>>>> _bulk_docs,
>>>> >> and b) any of the documents in that transaction are still at the
>>>> revision
>>>> >> generated by the transaction.  Same thing goes for revision stemming --
>>>> we
>>>> >> can never drop revisions if they were part of an atomic upload and at
>>>> least
>>>> >> one of the document revs in the upload is still current.
>>>> >
>>>> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
>>>> > you specify 'atomic' then you must pay more in terms of data size,
>>>> > performance, etc.
>>>>
>>>> The problem as I see it is that someone else's bulk transaction will
>>>> have to sit around in my database, until I edit all the docs in it.
>>>> Hopefully I won't get any distributed conflicts on other old versions
>>>> of docs in the group because this would put edits that I've done
>>>> locally to other documents in the bulk group, somehow less valid.
>>>>
>>>> Distributed bulk transactions would make for chaotic behavior, as
>>>> someone's mostly unrelated change on a remote node could eventually
>>>> replicate to me (months later) and knock an entire line of work that
>>>> I've done into a conflict state.
>>>>
>>>> If you want atomicity, put it in a single document.
>>>>
>>>> Chris
>>>>
>>>> >
>>>> > In 0.8 you would have theoretically had to pay by default, but didn't
>>>> > because replication broke transactions.
>>>> >
>>>> > The basic algorithm is still the same, but the garbage collected unit
>>>> > is changed (instead of garbage collecting document revisions it
>>>> > garbage collects revision sets, with the current case being a set with
>>>> > one member. The rules still apply (if this object is wholly shadowed
>>>> > by non conflicting changes then it can be disposed of)). IIRC the
>>>> > algorithm is a copying garbage collector, so this is pretty easy to do
>>>> > (you walk a DAG instead of a linked list).
>>>> >
>>>> > Under the proposed model you'd choose which operations are
>>>> > transactional and will have to pay for those.
>>>> >
>>>> >
>>>> > Anwyay, thanks for your link as well, I was reading through a rather
>>>> > boring thread and didn't see this one, so I guess I did miss out. It
>>>> > seemed to imply the discussion was done only on IRC.
>>>> >
>>>> > Anyway, here goes...
>>>> >
>>>> > The fundamental problem is that any consistent data model needs at the
>>>> > very least to have atomic primitives and ordered message passing (with
>>>> > transactional message handlers) at the per-partition level, or
>>>> > atomicity and consistency is restricted to a single document.
>>>> >
>>>> > What concerns me is Damien's post
>>>> > (
>>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e
>>>> ):
>>>> >
>>>> >> No, CouchDB replication doesn't support replicating the transactions.
>>>> >> Never has, never will. That's more like transaction log replication
>>>> >> that's in traditonal dbs, a different beast.
>>>> >>
>>>> >> For the new bulk transaction model, I'm only proposing supporting
>>>> >> eventual consistency. All changes are safe to disk, but the db may not
>>>> >> be in a consistent state right away.
>>>> >
>>>> > From what I know this assumption is wrong. Eventual consistency still
>>>> > needs atomic primitives, it's not about whether or not you have
>>>> > transactions, it's about what data they affect (eventual consistency
>>>> > involves breaking them down).
>>>> >
>>>> > Anyway, "never will" sounds pretty binding, but for the sake of argument:
>>>> >
>>>> > By using only insertions and idempotent updates for the bulk of the
>>>> > data changes and a message queue whose handlers use atomic updates to
>>>> > integrate this data one can implement a truly atomic distributed
>>>> > model, or an eventual consistency, but without this updates need to be
>>>> > restricted to exactly one document.
>>>> >
>>>> > Eventual consistency is still possible using either locks or by
>>>> > breaking down what would have been large distributed transactions into
>>>> > smaller ones, but the key is that the code that will make things
>>>> > actually consistent must still have ACID guarantees (and be dispatched
>>>> > in order).
>>>> >
>>>> > The 0.9 model CouchDB is effectively MyISAM without data loss, but
>>>> > just because the data is around doesn't mean it's possible to know
>>>> > what to do with it (loss of context), or even fix it safely (the
>>>> > conflict resolution code is susceptible to conflicts too).
>>>> >
>>>> > Unfortunately for eventual consistency to actually work the breaking
>>>> > down of operations must be done on application level, the database
>>>> > can't decide which data can be deferred and which data cannot.
>>>> >
>>>> > All immutable data and all new data can obviously be added to the
>>>> > database outside of a transaction, but eventually a transaction
>>>> > linking this data must be part of an atomic mutation.
>>>> >
>>>> > The only way to support this without atomic operations on a unit
>>>> > larger than a document, is to have a "master" document for every
>>>> > transitive closure the graph structure requiring consistency, which in
>>>> > effect only actually relates to immutable snapshot documents (e.g.
>>>> > where the ID is a hash of the data). If these closures overlap then a
>>>> > single "master" for the whole graph will be needed.
>>>> >
>>>> >
>>>> > To illustrate, let's make up a social networking example. Let's say
>>>> > you are adding a friend on this social network, and that this
>>>> > operation involves 3 updates, one to add a link from your profile to
>>>> > your friend's ID, another for the inverse, and a third update to
>>>> > update to send a "hello" message to the friend, updating their inbox.
>>>> > The first update lives in one partition, and the second and third
>>>> > updates are on a second one.
>>>> >
>>>> > The back pointers in your new friends must be updated. In an fully
>>>> > transactional model this would lock the friend's document and yours at
>>>> > the same time, in an eventual consistency model this would queue a
>>>> > message for the friend's partition, and a message handler on the
>>>> > friend's partition would update this atomically "eventually". It's
>>>> > fine for the link to be out of date for a while, but eventually it
>>>> > needs to be fixed (e.g. if you want to remove the friend, message
>>>> > them, etc).
>>>> >
>>>> > In couchdb 0.9 one of the writes will get a "conflict" error back, and
>>>> > they could refetch the updated version and try the edit again. The
>>>> > problem is that if the wrote the third update update to another
>>>> > document on the same node making assumptions about the same data, that
>>>> > write may have succeeded, leaving the data inconsistent. Under an
>>>> > eventual consistency model you still use transactions to do these
>>>> > updates, you just must design your model to break them down into
>>>> > smaller units.
>>>> >
>>>> > The reason a graph structure is more susceptible to inconsistency is
>>>> > that while in a relational model many data linkage operations can be
>>>> > done with a single insert/update (e.g. `insert into edges (node1_id,
>>>> > node2_id)`), in a document based database this type of opreation
>>>> > involves modifying all the affected documents. The chance of
>>>> > inconsistency is increased because contention is higher and there is
>>>> > more data that must be synchronized.
>>>> >
>>>> > However, in another post Damien said:
>>>> >
>>>> >> Which is why in general you want to avoid inter-document dependencies,
>>>> >> or be relaxed in how you deal with them.
>>>> >
>>>> > So I think I best shut up after this without some decision maker
>>>> > telling me not to, if my use case is not covered by the intended
>>>> > design then that's that, but I do think this thread sort of covers
>>>> > this:
>>>> >
>>>> >> As far as distributed transactions go, I'd be thrilled if we could
>>>> >> implement it and also support the rest of couchdb, like views and bi-
>>>> >> directional replication. Please start up a discussion here in dev@
>>>> >> about it and see if you can work out a design.
>>>> >
>>>> > Without going too pie-in-the-sky.
>>>> >
>>>> > Cheers,
>>>> > Yuval
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchrisa.net
>>>> http://couch.io
>>>>
>>>
>>
>

Re: reiterating transactions vs. replication

Posted by Yuval Kogman <no...@woobling.org>.

2009/5/25 Chris Anderson <jc...@apache.org>:

> Right, but at least in those cases I can diff the document to figure
> out what the conflict is. The bulk transactions you describe could
> become conflicting and before I could save the doc I'm working on, I'd
> have to figure which other doc was causing the conflict.

The idea is that you can ask for a conflict earlier if that's what's
going to help. If you need more context then arguably there should be
a way to ask for it to not be thrown away.

If you have a conflict in the aforementioned sorted collections and/or
graph nodes you need to know the state of other documents to merge
correctly.

I still think this is out of scope anyway. For a proper merge you'd
want the common ancestor documents as well, not just the symmetric
difference. This is best done by implementing a versioning model on
top of couchdb. But to implement this model consistently you arguably
still need atomic primitives.

> I'm not sure
> why not to just call the larger unit of data a single document, if
> that's how you want to use it.

So basically instead of using multiple documents all of my data would
go in one document? Why didn't I think of that ;-)

Re: reiterating transactions vs. replication

Posted by Chris Anderson <jc...@apache.org>.

On Sun, May 24, 2009 at 10:44 PM, Scott Shumaker <ss...@gmail.com> wrote:
> Chris - you said:
> "Distributed bulk transactions would make for chaotic behavior, as
> someone's mostly unrelated change on a remote node could eventually
> replicate to me (months later) and knock an entire line of work that
> I've done into a conflict state."
>
> Months later?  If you're really out of sync with a remote node for
> several months, you should expect to get lots of conflicts if people
> are editing your shared documents, with or without distributed bulk
> transactions.

Right, but at least in those cases I can diff the document to figure
out what the conflict is. The bulk transactions you describe could
become conflicting and before I could save the doc I'm working on, I'd
have to figure which other doc was causing the conflict. I'm not sure
why not to just call the larger unit of data a single document, if
that's how you want to use it.

>
> And if you really have a system designed to be offline for several
> months, you're solving very different problems (albeit interesting
> ones) than traditional data 'replication' as most people think about
> it (when one entity controls the servers, even if they are
> geographically distributed) - and getting into DVCS-like territory.

The idea is to support offline access to distributed data across all
devices. Some devices are very small (like a phone) and others are
very large (like a datacenter). It is our aim to provide a unified API
across all of them.


>
> If that's the case, it does explain why there are some decisions in
> CouchDB I find strange, like the fact that authentication runs during
> replication.

Auth during replication has to happen because as far as CouchDB's
concerned replication is just another client. Having the same API for
usage and replication is part of what keeps CouchDB simple.

> You're using replication for a completely different
> purpose than I need it for - I need it for redundancy and read
> scaling, you're using it to synchronize disparate data sources.  Very
> different problems that call for very different solutions.
>
>
> On Sun, May 24, 2009 at 10:31 PM, Scott Shumaker <ss...@gmail.com> wrote:
>> Inter-document dependencies come up pretty quickly when you start
>> trying to represent complex data structures in CouchDB.  There are
>> still a few cases we've encountered where there isn't a great way to
>> avoid needing transactions.  A few examples:
>>
>> 1)
>> 'A' maintains an ordered list of 'B' elements, where the order is
>> user-defined - imagine allowing a user to re-order photos in a
>> slideshow.  You want to store the photos separately from the
>> slideshow, because they can be used in multiple slideshows.  Whenever
>> you add or remove a photo, you need to update the order list as well.
>>
>> I've seen some people suggest some sort of gross solution where you
>> try to store floating point order id's inside the B elements and
>> change that to wedge an item in between another items (averaging the
>> two new siblings' orders), but this is incredibly brittle and breaks
>> down with enough re-ordering.
>>
>> Another unpleasant approach is to create separate 'order objects' in
>> couchdb (representing the order of an item within a folder), storing
>> an internal version number (timestamp) inside the 'order object' - so
>> you never change the order node, you just create a new node.  Then,
>> you only use use the 'latest' version of this order node (either on
>> the client side or with a reduce).  To make this work, you need to
>> ensure that your 'internal version numbers' are monotonically
>> increasing.  This isn't a problem for some applications, and can be
>> solved in general with a specialized 'number server'.
>>
>> 2)
>> Representing graph/linked-list datastructures.
>>
>> If you delete a node from a linked list, you need to update two nodes
>> - the previous node and the node itself.  You can try the second
>> suggestion in the previous item to make this work (effectively, store
>> the link relationship as separate objects and generate new link
>> objects with incrementing version numbers)
>>
>> I'm sure there are other cases - these two just have been a thorn in
>> our side.  But for a lot of non-trivial data applications,
>> transactions still end up being very useful.
>>
>> On Fri, May 22, 2009 at 2:45 PM, Nathan Stott <nr...@gmail.com> wrote:
>>> As a user, when I chose couchdb for my most recent project, I chose it
>>> because I didn't care about transactions.  I would've used RDBMS if that
>>> were important.
>>> I chose it because couch solved the problems I needed solved very well.
>>>
>>> I don't think transactions should be a big dev focus.
>>>
>>> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson <jc...@apache.org> wrote:
>>>
>>>> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman <no...@woobling.org>
>>>> wrote:
>>>> > 2009/5/21 Adam Kocoloski <ko...@apache.org>:
>>>> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>>>> >> rehash all the discussion from back in February (see the thread
>>>> beginning at
>>>> >>
>>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e
>>>>  for
>>>> >> a particularly detailed discussion), but I do want to comment on one
>>>> aspect.
>>>> >>
>>>> >> Updating the replicator to be smart about atomic bulk transactions is
>>>> doable
>>>> >> (although a major undertaking), but when you throw DB compaction and
>>>> >> revision stemming into the mix things get really hairy.  Recall that
>>>> CouchDB
>>>> >> revisions are used for concurrency control, not for maintaining history.
>>>> >>  Consider the following sequence of events:
>>>> >>
>>>> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>>>> >> 2) Update foo -> foo/2
>>>> >> Compact the DB (foo/1 is deleted)
>>>> >> Start replicating to a mirror
>>>> >> Replication crashes before it reaches foo/2
>>>> >
>>>> > By crash you mean an error due to a conflict between foo/2 and foo/1'
>>>> > (the mirror's version of foo), right?
>>>> >
>>>> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>>>>  I
>>>> >> think this means we'd need to modify the compaction algorithm to keep
>>>> >> revisions of documents if a) the revision was part of an atomic
>>>> _bulk_docs,
>>>> >> and b) any of the documents in that transaction are still at the
>>>> revision
>>>> >> generated by the transaction.  Same thing goes for revision stemming --
>>>> we
>>>> >> can never drop revisions if they were part of an atomic upload and at
>>>> least
>>>> >> one of the document revs in the upload is still current.
>>>> >
>>>> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
>>>> > you specify 'atomic' then you must pay more in terms of data size,
>>>> > performance, etc.
>>>>
>>>> The problem as I see it is that someone else's bulk transaction will
>>>> have to sit around in my database, until I edit all the docs in it.
>>>> Hopefully I won't get any distributed conflicts on other old versions
>>>> of docs in the group because this would put edits that I've done
>>>> locally to other documents in the bulk group, somehow less valid.
>>>>
>>>> Distributed bulk transactions would make for chaotic behavior, as
>>>> someone's mostly unrelated change on a remote node could eventually
>>>> replicate to me (months later) and knock an entire line of work that
>>>> I've done into a conflict state.
>>>>
>>>> If you want atomicity, put it in a single document.
>>>>
>>>> Chris
>>>>
>>>> >
>>>> > In 0.8 you would have theoretically had to pay by default, but didn't
>>>> > because replication broke transactions.
>>>> >
>>>> > The basic algorithm is still the same, but the garbage collected unit
>>>> > is changed (instead of garbage collecting document revisions it
>>>> > garbage collects revision sets, with the current case being a set with
>>>> > one member. The rules still apply (if this object is wholly shadowed
>>>> > by non conflicting changes then it can be disposed of)). IIRC the
>>>> > algorithm is a copying garbage collector, so this is pretty easy to do
>>>> > (you walk a DAG instead of a linked list).
>>>> >
>>>> > Under the proposed model you'd choose which operations are
>>>> > transactional and will have to pay for those.
>>>> >
>>>> >
>>>> > Anwyay, thanks for your link as well, I was reading through a rather
>>>> > boring thread and didn't see this one, so I guess I did miss out. It
>>>> > seemed to imply the discussion was done only on IRC.
>>>> >
>>>> > Anyway, here goes...
>>>> >
>>>> > The fundamental problem is that any consistent data model needs at the
>>>> > very least to have atomic primitives and ordered message passing (with
>>>> > transactional message handlers) at the per-partition level, or
>>>> > atomicity and consistency is restricted to a single document.
>>>> >
>>>> > What concerns me is Damien's post
>>>> > (
>>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e
>>>> ):
>>>> >
>>>> >> No, CouchDB replication doesn't support replicating the transactions.
>>>> >> Never has, never will. That's more like transaction log replication
>>>> >> that's in traditonal dbs, a different beast.
>>>> >>
>>>> >> For the new bulk transaction model, I'm only proposing supporting
>>>> >> eventual consistency. All changes are safe to disk, but the db may not
>>>> >> be in a consistent state right away.
>>>> >
>>>> > From what I know this assumption is wrong. Eventual consistency still
>>>> > needs atomic primitives, it's not about whether or not you have
>>>> > transactions, it's about what data they affect (eventual consistency
>>>> > involves breaking them down).
>>>> >
>>>> > Anyway, "never will" sounds pretty binding, but for the sake of argument:
>>>> >
>>>> > By using only insertions and idempotent updates for the bulk of the
>>>> > data changes and a message queue whose handlers use atomic updates to
>>>> > integrate this data one can implement a truly atomic distributed
>>>> > model, or an eventual consistency, but without this updates need to be
>>>> > restricted to exactly one document.
>>>> >
>>>> > Eventual consistency is still possible using either locks or by
>>>> > breaking down what would have been large distributed transactions into
>>>> > smaller ones, but the key is that the code that will make things
>>>> > actually consistent must still have ACID guarantees (and be dispatched
>>>> > in order).
>>>> >
>>>> > The 0.9 model CouchDB is effectively MyISAM without data loss, but
>>>> > just because the data is around doesn't mean it's possible to know
>>>> > what to do with it (loss of context), or even fix it safely (the
>>>> > conflict resolution code is susceptible to conflicts too).
>>>> >
>>>> > Unfortunately for eventual consistency to actually work the breaking
>>>> > down of operations must be done on application level, the database
>>>> > can't decide which data can be deferred and which data cannot.
>>>> >
>>>> > All immutable data and all new data can obviously be added to the
>>>> > database outside of a transaction, but eventually a transaction
>>>> > linking this data must be part of an atomic mutation.
>>>> >
>>>> > The only way to support this without atomic operations on a unit
>>>> > larger than a document, is to have a "master" document for every
>>>> > transitive closure the graph structure requiring consistency, which in
>>>> > effect only actually relates to immutable snapshot documents (e.g.
>>>> > where the ID is a hash of the data). If these closures overlap then a
>>>> > single "master" for the whole graph will be needed.
>>>> >
>>>> >
>>>> > To illustrate, let's make up a social networking example. Let's say
>>>> > you are adding a friend on this social network, and that this
>>>> > operation involves 3 updates, one to add a link from your profile to
>>>> > your friend's ID, another for the inverse, and a third update to
>>>> > update to send a "hello" message to the friend, updating their inbox.
>>>> > The first update lives in one partition, and the second and third
>>>> > updates are on a second one.
>>>> >
>>>> > The back pointers in your new friends must be updated. In an fully
>>>> > transactional model this would lock the friend's document and yours at
>>>> > the same time, in an eventual consistency model this would queue a
>>>> > message for the friend's partition, and a message handler on the
>>>> > friend's partition would update this atomically "eventually". It's
>>>> > fine for the link to be out of date for a while, but eventually it
>>>> > needs to be fixed (e.g. if you want to remove the friend, message
>>>> > them, etc).
>>>> >
>>>> > In couchdb 0.9 one of the writes will get a "conflict" error back, and
>>>> > they could refetch the updated version and try the edit again. The
>>>> > problem is that if the wrote the third update update to another
>>>> > document on the same node making assumptions about the same data, that
>>>> > write may have succeeded, leaving the data inconsistent. Under an
>>>> > eventual consistency model you still use transactions to do these
>>>> > updates, you just must design your model to break them down into
>>>> > smaller units.
>>>> >
>>>> > The reason a graph structure is more susceptible to inconsistency is
>>>> > that while in a relational model many data linkage operations can be
>>>> > done with a single insert/update (e.g. `insert into edges (node1_id,
>>>> > node2_id)`), in a document based database this type of opreation
>>>> > involves modifying all the affected documents. The chance of
>>>> > inconsistency is increased because contention is higher and there is
>>>> > more data that must be synchronized.
>>>> >
>>>> > However, in another post Damien said:
>>>> >
>>>> >> Which is why in general you want to avoid inter-document dependencies,
>>>> >> or be relaxed in how you deal with them.
>>>> >
>>>> > So I think I best shut up after this without some decision maker
>>>> > telling me not to, if my use case is not covered by the intended
>>>> > design then that's that, but I do think this thread sort of covers
>>>> > this:
>>>> >
>>>> >> As far as distributed transactions go, I'd be thrilled if we could
>>>> >> implement it and also support the rest of couchdb, like views and bi-
>>>> >> directional replication. Please start up a discussion here in dev@
>>>> >> about it and see if you can work out a design.
>>>> >
>>>> > Without going too pie-in-the-sky.
>>>> >
>>>> > Cheers,
>>>> > Yuval
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchrisa.net
>>>> http://couch.io
>>>>
>>>
>>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: reiterating transactions vs. replication

Posted by Randall Leeds <ra...@gmail.com>.

On Mon, May 25, 2009 at 01:44, Scott Shumaker <ss...@gmail.com> wrote:

> If that's the case, it does explain why there are some decisions in
> CouchDB I find strange, like the fact that authentication runs during
> replication.  You're using replication for a completely different
> purpose than I need it for - I need it for redundancy and read
> scaling, you're using it to synchronize disparate data sources.  Very
> different problems that call for very different solutions.

I called out this distinction between usages in my message above.
If the replication functionality were "directed" or overseen by the
sharding/partitioning code in the data center deployment, I think we can
totally prevent replication conflicts *caused by* the partitioning code.
However, replicating to/from the data center could still create conflicts in
the normal way.

Re: reiterating transactions vs. replication

Posted by Scott Shumaker <ss...@gmail.com>.

Chris - you said:
"Distributed bulk transactions would make for chaotic behavior, as
someone's mostly unrelated change on a remote node could eventually
replicate to me (months later) and knock an entire line of work that
I've done into a conflict state."

Months later?  If you're really out of sync with a remote node for
several months, you should expect to get lots of conflicts if people
are editing your shared documents, with or without distributed bulk
transactions.

And if you really have a system designed to be offline for several
months, you're solving very different problems (albeit interesting
ones) than traditional data 'replication' as most people think about
it (when one entity controls the servers, even if they are
geographically distributed) - and getting into DVCS-like territory.

If that's the case, it does explain why there are some decisions in
CouchDB I find strange, like the fact that authentication runs during
replication.  You're using replication for a completely different
purpose than I need it for - I need it for redundancy and read
scaling, you're using it to synchronize disparate data sources.  Very
different problems that call for very different solutions.


On Sun, May 24, 2009 at 10:31 PM, Scott Shumaker <ss...@gmail.com> wrote:
> Inter-document dependencies come up pretty quickly when you start
> trying to represent complex data structures in CouchDB.  There are
> still a few cases we've encountered where there isn't a great way to
> avoid needing transactions.  A few examples:
>
> 1)
> 'A' maintains an ordered list of 'B' elements, where the order is
> user-defined - imagine allowing a user to re-order photos in a
> slideshow.  You want to store the photos separately from the
> slideshow, because they can be used in multiple slideshows.  Whenever
> you add or remove a photo, you need to update the order list as well.
>
> I've seen some people suggest some sort of gross solution where you
> try to store floating point order id's inside the B elements and
> change that to wedge an item in between another items (averaging the
> two new siblings' orders), but this is incredibly brittle and breaks
> down with enough re-ordering.
>
> Another unpleasant approach is to create separate 'order objects' in
> couchdb (representing the order of an item within a folder), storing
> an internal version number (timestamp) inside the 'order object' - so
> you never change the order node, you just create a new node.  Then,
> you only use use the 'latest' version of this order node (either on
> the client side or with a reduce).  To make this work, you need to
> ensure that your 'internal version numbers' are monotonically
> increasing.  This isn't a problem for some applications, and can be
> solved in general with a specialized 'number server'.
>
> 2)
> Representing graph/linked-list datastructures.
>
> If you delete a node from a linked list, you need to update two nodes
> - the previous node and the node itself.  You can try the second
> suggestion in the previous item to make this work (effectively, store
> the link relationship as separate objects and generate new link
> objects with incrementing version numbers)
>
> I'm sure there are other cases - these two just have been a thorn in
> our side.  But for a lot of non-trivial data applications,
> transactions still end up being very useful.
>
> On Fri, May 22, 2009 at 2:45 PM, Nathan Stott <nr...@gmail.com> wrote:
>> As a user, when I chose couchdb for my most recent project, I chose it
>> because I didn't care about transactions.  I would've used RDBMS if that
>> were important.
>> I chose it because couch solved the problems I needed solved very well.
>>
>> I don't think transactions should be a big dev focus.
>>
>> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson <jc...@apache.org> wrote:
>>
>>> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman <no...@woobling.org>
>>> wrote:
>>> > 2009/5/21 Adam Kocoloski <ko...@apache.org>:
>>> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>>> >> rehash all the discussion from back in February (see the thread
>>> beginning at
>>> >>
>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e
>>>  for
>>> >> a particularly detailed discussion), but I do want to comment on one
>>> aspect.
>>> >>
>>> >> Updating the replicator to be smart about atomic bulk transactions is
>>> doable
>>> >> (although a major undertaking), but when you throw DB compaction and
>>> >> revision stemming into the mix things get really hairy.  Recall that
>>> CouchDB
>>> >> revisions are used for concurrency control, not for maintaining history.
>>> >>  Consider the following sequence of events:
>>> >>
>>> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>>> >> 2) Update foo -> foo/2
>>> >> Compact the DB (foo/1 is deleted)
>>> >> Start replicating to a mirror
>>> >> Replication crashes before it reaches foo/2
>>> >
>>> > By crash you mean an error due to a conflict between foo/2 and foo/1'
>>> > (the mirror's version of foo), right?
>>> >
>>> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>>>  I
>>> >> think this means we'd need to modify the compaction algorithm to keep
>>> >> revisions of documents if a) the revision was part of an atomic
>>> _bulk_docs,
>>> >> and b) any of the documents in that transaction are still at the
>>> revision
>>> >> generated by the transaction.  Same thing goes for revision stemming --
>>> we
>>> >> can never drop revisions if they were part of an atomic upload and at
>>> least
>>> >> one of the document revs in the upload is still current.
>>> >
>>> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
>>> > you specify 'atomic' then you must pay more in terms of data size,
>>> > performance, etc.
>>>
>>> The problem as I see it is that someone else's bulk transaction will
>>> have to sit around in my database, until I edit all the docs in it.
>>> Hopefully I won't get any distributed conflicts on other old versions
>>> of docs in the group because this would put edits that I've done
>>> locally to other documents in the bulk group, somehow less valid.
>>>
>>> Distributed bulk transactions would make for chaotic behavior, as
>>> someone's mostly unrelated change on a remote node could eventually
>>> replicate to me (months later) and knock an entire line of work that
>>> I've done into a conflict state.
>>>
>>> If you want atomicity, put it in a single document.
>>>
>>> Chris
>>>
>>> >
>>> > In 0.8 you would have theoretically had to pay by default, but didn't
>>> > because replication broke transactions.
>>> >
>>> > The basic algorithm is still the same, but the garbage collected unit
>>> > is changed (instead of garbage collecting document revisions it
>>> > garbage collects revision sets, with the current case being a set with
>>> > one member. The rules still apply (if this object is wholly shadowed
>>> > by non conflicting changes then it can be disposed of)). IIRC the
>>> > algorithm is a copying garbage collector, so this is pretty easy to do
>>> > (you walk a DAG instead of a linked list).
>>> >
>>> > Under the proposed model you'd choose which operations are
>>> > transactional and will have to pay for those.
>>> >
>>> >
>>> > Anwyay, thanks for your link as well, I was reading through a rather
>>> > boring thread and didn't see this one, so I guess I did miss out. It
>>> > seemed to imply the discussion was done only on IRC.
>>> >
>>> > Anyway, here goes...
>>> >
>>> > The fundamental problem is that any consistent data model needs at the
>>> > very least to have atomic primitives and ordered message passing (with
>>> > transactional message handlers) at the per-partition level, or
>>> > atomicity and consistency is restricted to a single document.
>>> >
>>> > What concerns me is Damien's post
>>> > (
>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e
>>> ):
>>> >
>>> >> No, CouchDB replication doesn't support replicating the transactions.
>>> >> Never has, never will. That's more like transaction log replication
>>> >> that's in traditonal dbs, a different beast.
>>> >>
>>> >> For the new bulk transaction model, I'm only proposing supporting
>>> >> eventual consistency. All changes are safe to disk, but the db may not
>>> >> be in a consistent state right away.
>>> >
>>> > From what I know this assumption is wrong. Eventual consistency still
>>> > needs atomic primitives, it's not about whether or not you have
>>> > transactions, it's about what data they affect (eventual consistency
>>> > involves breaking them down).
>>> >
>>> > Anyway, "never will" sounds pretty binding, but for the sake of argument:
>>> >
>>> > By using only insertions and idempotent updates for the bulk of the
>>> > data changes and a message queue whose handlers use atomic updates to
>>> > integrate this data one can implement a truly atomic distributed
>>> > model, or an eventual consistency, but without this updates need to be
>>> > restricted to exactly one document.
>>> >
>>> > Eventual consistency is still possible using either locks or by
>>> > breaking down what would have been large distributed transactions into
>>> > smaller ones, but the key is that the code that will make things
>>> > actually consistent must still have ACID guarantees (and be dispatched
>>> > in order).
>>> >
>>> > The 0.9 model CouchDB is effectively MyISAM without data loss, but
>>> > just because the data is around doesn't mean it's possible to know
>>> > what to do with it (loss of context), or even fix it safely (the
>>> > conflict resolution code is susceptible to conflicts too).
>>> >
>>> > Unfortunately for eventual consistency to actually work the breaking
>>> > down of operations must be done on application level, the database
>>> > can't decide which data can be deferred and which data cannot.
>>> >
>>> > All immutable data and all new data can obviously be added to the
>>> > database outside of a transaction, but eventually a transaction
>>> > linking this data must be part of an atomic mutation.
>>> >
>>> > The only way to support this without atomic operations on a unit
>>> > larger than a document, is to have a "master" document for every
>>> > transitive closure the graph structure requiring consistency, which in
>>> > effect only actually relates to immutable snapshot documents (e.g.
>>> > where the ID is a hash of the data). If these closures overlap then a
>>> > single "master" for the whole graph will be needed.
>>> >
>>> >
>>> > To illustrate, let's make up a social networking example. Let's say
>>> > you are adding a friend on this social network, and that this
>>> > operation involves 3 updates, one to add a link from your profile to
>>> > your friend's ID, another for the inverse, and a third update to
>>> > update to send a "hello" message to the friend, updating their inbox.
>>> > The first update lives in one partition, and the second and third
>>> > updates are on a second one.
>>> >
>>> > The back pointers in your new friends must be updated. In an fully
>>> > transactional model this would lock the friend's document and yours at
>>> > the same time, in an eventual consistency model this would queue a
>>> > message for the friend's partition, and a message handler on the
>>> > friend's partition would update this atomically "eventually". It's
>>> > fine for the link to be out of date for a while, but eventually it
>>> > needs to be fixed (e.g. if you want to remove the friend, message
>>> > them, etc).
>>> >
>>> > In couchdb 0.9 one of the writes will get a "conflict" error back, and
>>> > they could refetch the updated version and try the edit again. The
>>> > problem is that if the wrote the third update update to another
>>> > document on the same node making assumptions about the same data, that
>>> > write may have succeeded, leaving the data inconsistent. Under an
>>> > eventual consistency model you still use transactions to do these
>>> > updates, you just must design your model to break them down into
>>> > smaller units.
>>> >
>>> > The reason a graph structure is more susceptible to inconsistency is
>>> > that while in a relational model many data linkage operations can be
>>> > done with a single insert/update (e.g. `insert into edges (node1_id,
>>> > node2_id)`), in a document based database this type of opreation
>>> > involves modifying all the affected documents. The chance of
>>> > inconsistency is increased because contention is higher and there is
>>> > more data that must be synchronized.
>>> >
>>> > However, in another post Damien said:
>>> >
>>> >> Which is why in general you want to avoid inter-document dependencies,
>>> >> or be relaxed in how you deal with them.
>>> >
>>> > So I think I best shut up after this without some decision maker
>>> > telling me not to, if my use case is not covered by the intended
>>> > design then that's that, but I do think this thread sort of covers
>>> > this:
>>> >
>>> >> As far as distributed transactions go, I'd be thrilled if we could
>>> >> implement it and also support the rest of couchdb, like views and bi-
>>> >> directional replication. Please start up a discussion here in dev@
>>> >> about it and see if you can work out a design.
>>> >
>>> > Without going too pie-in-the-sky.
>>> >
>>> > Cheers,
>>> > Yuval
>>> >
>>>
>>>
>>>
>>> --
>>> Chris Anderson
>>> http://jchrisa.net
>>> http://couch.io
>>>
>>
>

Re: reiterating transactions vs. replication

Posted by Scott Shumaker <ss...@gmail.com>.

Inter-document dependencies come up pretty quickly when you start
trying to represent complex data structures in CouchDB.  There are
still a few cases we've encountered where there isn't a great way to
avoid needing transactions.  A few examples:

1)
'A' maintains an ordered list of 'B' elements, where the order is
user-defined - imagine allowing a user to re-order photos in a
slideshow.  You want to store the photos separately from the
slideshow, because they can be used in multiple slideshows.  Whenever
you add or remove a photo, you need to update the order list as well.

I've seen some people suggest some sort of gross solution where you
try to store floating point order id's inside the B elements and
change that to wedge an item in between another items (averaging the
two new siblings' orders), but this is incredibly brittle and breaks
down with enough re-ordering.

Another unpleasant approach is to create separate 'order objects' in
couchdb (representing the order of an item within a folder), storing
an internal version number (timestamp) inside the 'order object' - so
you never change the order node, you just create a new node.  Then,
you only use use the 'latest' version of this order node (either on
the client side or with a reduce).  To make this work, you need to
ensure that your 'internal version numbers' are monotonically
increasing.  This isn't a problem for some applications, and can be
solved in general with a specialized 'number server'.

2)
Representing graph/linked-list datastructures.

If you delete a node from a linked list, you need to update two nodes
- the previous node and the node itself.  You can try the second
suggestion in the previous item to make this work (effectively, store
the link relationship as separate objects and generate new link
objects with incrementing version numbers)

I'm sure there are other cases - these two just have been a thorn in
our side.  But for a lot of non-trivial data applications,
transactions still end up being very useful.

On Fri, May 22, 2009 at 2:45 PM, Nathan Stott <nr...@gmail.com> wrote:
> As a user, when I chose couchdb for my most recent project, I chose it
> because I didn't care about transactions.  I would've used RDBMS if that
> were important.
> I chose it because couch solved the problems I needed solved very well.
>
> I don't think transactions should be a big dev focus.
>
> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson <jc...@apache.org> wrote:
>
>> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman <no...@woobling.org>
>> wrote:
>> > 2009/5/21 Adam Kocoloski <ko...@apache.org>:
>> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>> >> rehash all the discussion from back in February (see the thread
>> beginning at
>> >>
>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e
>>  for
>> >> a particularly detailed discussion), but I do want to comment on one
>> aspect.
>> >>
>> >> Updating the replicator to be smart about atomic bulk transactions is
>> doable
>> >> (although a major undertaking), but when you throw DB compaction and
>> >> revision stemming into the mix things get really hairy.  Recall that
>> CouchDB
>> >> revisions are used for concurrency control, not for maintaining history.
>> >>  Consider the following sequence of events:
>> >>
>> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>> >> 2) Update foo -> foo/2
>> >> Compact the DB (foo/1 is deleted)
>> >> Start replicating to a mirror
>> >> Replication crashes before it reaches foo/2
>> >
>> > By crash you mean an error due to a conflict between foo/2 and foo/1'
>> > (the mirror's version of foo), right?
>> >
>> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>>  I
>> >> think this means we'd need to modify the compaction algorithm to keep
>> >> revisions of documents if a) the revision was part of an atomic
>> _bulk_docs,
>> >> and b) any of the documents in that transaction are still at the
>> revision
>> >> generated by the transaction.  Same thing goes for revision stemming --
>> we
>> >> can never drop revisions if they were part of an atomic upload and at
>> least
>> >> one of the document revs in the upload is still current.
>> >
>> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
>> > you specify 'atomic' then you must pay more in terms of data size,
>> > performance, etc.
>>
>> The problem as I see it is that someone else's bulk transaction will
>> have to sit around in my database, until I edit all the docs in it.
>> Hopefully I won't get any distributed conflicts on other old versions
>> of docs in the group because this would put edits that I've done
>> locally to other documents in the bulk group, somehow less valid.
>>
>> Distributed bulk transactions would make for chaotic behavior, as
>> someone's mostly unrelated change on a remote node could eventually
>> replicate to me (months later) and knock an entire line of work that
>> I've done into a conflict state.
>>
>> If you want atomicity, put it in a single document.
>>
>> Chris
>>
>> >
>> > In 0.8 you would have theoretically had to pay by default, but didn't
>> > because replication broke transactions.
>> >
>> > The basic algorithm is still the same, but the garbage collected unit
>> > is changed (instead of garbage collecting document revisions it
>> > garbage collects revision sets, with the current case being a set with
>> > one member. The rules still apply (if this object is wholly shadowed
>> > by non conflicting changes then it can be disposed of)). IIRC the
>> > algorithm is a copying garbage collector, so this is pretty easy to do
>> > (you walk a DAG instead of a linked list).
>> >
>> > Under the proposed model you'd choose which operations are
>> > transactional and will have to pay for those.
>> >
>> >
>> > Anwyay, thanks for your link as well, I was reading through a rather
>> > boring thread and didn't see this one, so I guess I did miss out. It
>> > seemed to imply the discussion was done only on IRC.
>> >
>> > Anyway, here goes...
>> >
>> > The fundamental problem is that any consistent data model needs at the
>> > very least to have atomic primitives and ordered message passing (with
>> > transactional message handlers) at the per-partition level, or
>> > atomicity and consistency is restricted to a single document.
>> >
>> > What concerns me is Damien's post
>> > (
>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e
>> ):
>> >
>> >> No, CouchDB replication doesn't support replicating the transactions.
>> >> Never has, never will. That's more like transaction log replication
>> >> that's in traditonal dbs, a different beast.
>> >>
>> >> For the new bulk transaction model, I'm only proposing supporting
>> >> eventual consistency. All changes are safe to disk, but the db may not
>> >> be in a consistent state right away.
>> >
>> > From what I know this assumption is wrong. Eventual consistency still
>> > needs atomic primitives, it's not about whether or not you have
>> > transactions, it's about what data they affect (eventual consistency
>> > involves breaking them down).
>> >
>> > Anyway, "never will" sounds pretty binding, but for the sake of argument:
>> >
>> > By using only insertions and idempotent updates for the bulk of the
>> > data changes and a message queue whose handlers use atomic updates to
>> > integrate this data one can implement a truly atomic distributed
>> > model, or an eventual consistency, but without this updates need to be
>> > restricted to exactly one document.
>> >
>> > Eventual consistency is still possible using either locks or by
>> > breaking down what would have been large distributed transactions into
>> > smaller ones, but the key is that the code that will make things
>> > actually consistent must still have ACID guarantees (and be dispatched
>> > in order).
>> >
>> > The 0.9 model CouchDB is effectively MyISAM without data loss, but
>> > just because the data is around doesn't mean it's possible to know
>> > what to do with it (loss of context), or even fix it safely (the
>> > conflict resolution code is susceptible to conflicts too).
>> >
>> > Unfortunately for eventual consistency to actually work the breaking
>> > down of operations must be done on application level, the database
>> > can't decide which data can be deferred and which data cannot.
>> >
>> > All immutable data and all new data can obviously be added to the
>> > database outside of a transaction, but eventually a transaction
>> > linking this data must be part of an atomic mutation.
>> >
>> > The only way to support this without atomic operations on a unit
>> > larger than a document, is to have a "master" document for every
>> > transitive closure the graph structure requiring consistency, which in
>> > effect only actually relates to immutable snapshot documents (e.g.
>> > where the ID is a hash of the data). If these closures overlap then a
>> > single "master" for the whole graph will be needed.
>> >
>> >
>> > To illustrate, let's make up a social networking example. Let's say
>> > you are adding a friend on this social network, and that this
>> > operation involves 3 updates, one to add a link from your profile to
>> > your friend's ID, another for the inverse, and a third update to
>> > update to send a "hello" message to the friend, updating their inbox.
>> > The first update lives in one partition, and the second and third
>> > updates are on a second one.
>> >
>> > The back pointers in your new friends must be updated. In an fully
>> > transactional model this would lock the friend's document and yours at
>> > the same time, in an eventual consistency model this would queue a
>> > message for the friend's partition, and a message handler on the
>> > friend's partition would update this atomically "eventually". It's
>> > fine for the link to be out of date for a while, but eventually it
>> > needs to be fixed (e.g. if you want to remove the friend, message
>> > them, etc).
>> >
>> > In couchdb 0.9 one of the writes will get a "conflict" error back, and
>> > they could refetch the updated version and try the edit again. The
>> > problem is that if the wrote the third update update to another
>> > document on the same node making assumptions about the same data, that
>> > write may have succeeded, leaving the data inconsistent. Under an
>> > eventual consistency model you still use transactions to do these
>> > updates, you just must design your model to break them down into
>> > smaller units.
>> >
>> > The reason a graph structure is more susceptible to inconsistency is
>> > that while in a relational model many data linkage operations can be
>> > done with a single insert/update (e.g. `insert into edges (node1_id,
>> > node2_id)`), in a document based database this type of opreation
>> > involves modifying all the affected documents. The chance of
>> > inconsistency is increased because contention is higher and there is
>> > more data that must be synchronized.
>> >
>> > However, in another post Damien said:
>> >
>> >> Which is why in general you want to avoid inter-document dependencies,
>> >> or be relaxed in how you deal with them.
>> >
>> > So I think I best shut up after this without some decision maker
>> > telling me not to, if my use case is not covered by the intended
>> > design then that's that, but I do think this thread sort of covers
>> > this:
>> >
>> >> As far as distributed transactions go, I'd be thrilled if we could
>> >> implement it and also support the rest of couchdb, like views and bi-
>> >> directional replication. Please start up a discussion here in dev@
>> >> about it and see if you can work out a design.
>> >
>> > Without going too pie-in-the-sky.
>> >
>> > Cheers,
>> > Yuval
>> >
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchrisa.net
>> http://couch.io
>>
>

Re: reiterating transactions vs. replication

Posted by Nathan Stott <nr...@gmail.com>.

As a user, when I chose couchdb for my most recent project, I chose it
because I didn't care about transactions.  I would've used RDBMS if that
were important.
I chose it because couch solved the problems I needed solved very well.

I don't think transactions should be a big dev focus.

On Fri, May 22, 2009 at 4:30 PM, Chris Anderson <jc...@apache.org> wrote:

> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman <no...@woobling.org>
> wrote:
> > 2009/5/21 Adam Kocoloski <ko...@apache.org>:
> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
> >> rehash all the discussion from back in February (see the thread
> beginning at
> >>
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e
>  for
> >> a particularly detailed discussion), but I do want to comment on one
> aspect.
> >>
> >> Updating the replicator to be smart about atomic bulk transactions is
> doable
> >> (although a major undertaking), but when you throw DB compaction and
> >> revision stemming into the mix things get really hairy.  Recall that
> CouchDB
> >> revisions are used for concurrency control, not for maintaining history.
> >>  Consider the following sequence of events:
> >>
> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> >> 2) Update foo -> foo/2
> >> Compact the DB (foo/1 is deleted)
> >> Start replicating to a mirror
> >> Replication crashes before it reaches foo/2
> >
> > By crash you mean an error due to a conflict between foo/2 and foo/1'
> > (the mirror's version of foo), right?
> >
> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>  I
> >> think this means we'd need to modify the compaction algorithm to keep
> >> revisions of documents if a) the revision was part of an atomic
> _bulk_docs,
> >> and b) any of the documents in that transaction are still at the
> revision
> >> generated by the transaction.  Same thing goes for revision stemming --
> we
> >> can never drop revisions if they were part of an atomic upload and at
> least
> >> one of the document revs in the upload is still current.
> >
> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
> > you specify 'atomic' then you must pay more in terms of data size,
> > performance, etc.
>
> The problem as I see it is that someone else's bulk transaction will
> have to sit around in my database, until I edit all the docs in it.
> Hopefully I won't get any distributed conflicts on other old versions
> of docs in the group because this would put edits that I've done
> locally to other documents in the bulk group, somehow less valid.
>
> Distributed bulk transactions would make for chaotic behavior, as
> someone's mostly unrelated change on a remote node could eventually
> replicate to me (months later) and knock an entire line of work that
> I've done into a conflict state.
>
> If you want atomicity, put it in a single document.
>
> Chris
>
> >
> > In 0.8 you would have theoretically had to pay by default, but didn't
> > because replication broke transactions.
> >
> > The basic algorithm is still the same, but the garbage collected unit
> > is changed (instead of garbage collecting document revisions it
> > garbage collects revision sets, with the current case being a set with
> > one member. The rules still apply (if this object is wholly shadowed
> > by non conflicting changes then it can be disposed of)). IIRC the
> > algorithm is a copying garbage collector, so this is pretty easy to do
> > (you walk a DAG instead of a linked list).
> >
> > Under the proposed model you'd choose which operations are
> > transactional and will have to pay for those.
> >
> >
> > Anwyay, thanks for your link as well, I was reading through a rather
> > boring thread and didn't see this one, so I guess I did miss out. It
> > seemed to imply the discussion was done only on IRC.
> >
> > Anyway, here goes...
> >
> > The fundamental problem is that any consistent data model needs at the
> > very least to have atomic primitives and ordered message passing (with
> > transactional message handlers) at the per-partition level, or
> > atomicity and consistency is restricted to a single document.
> >
> > What concerns me is Damien's post
> > (
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e
> ):
> >
> >> No, CouchDB replication doesn't support replicating the transactions.
> >> Never has, never will. That's more like transaction log replication
> >> that's in traditonal dbs, a different beast.
> >>
> >> For the new bulk transaction model, I'm only proposing supporting
> >> eventual consistency. All changes are safe to disk, but the db may not
> >> be in a consistent state right away.
> >
> > From what I know this assumption is wrong. Eventual consistency still
> > needs atomic primitives, it's not about whether or not you have
> > transactions, it's about what data they affect (eventual consistency
> > involves breaking them down).
> >
> > Anyway, "never will" sounds pretty binding, but for the sake of argument:
> >
> > By using only insertions and idempotent updates for the bulk of the
> > data changes and a message queue whose handlers use atomic updates to
> > integrate this data one can implement a truly atomic distributed
> > model, or an eventual consistency, but without this updates need to be
> > restricted to exactly one document.
> >
> > Eventual consistency is still possible using either locks or by
> > breaking down what would have been large distributed transactions into
> > smaller ones, but the key is that the code that will make things
> > actually consistent must still have ACID guarantees (and be dispatched
> > in order).
> >
> > The 0.9 model CouchDB is effectively MyISAM without data loss, but
> > just because the data is around doesn't mean it's possible to know
> > what to do with it (loss of context), or even fix it safely (the
> > conflict resolution code is susceptible to conflicts too).
> >
> > Unfortunately for eventual consistency to actually work the breaking
> > down of operations must be done on application level, the database
> > can't decide which data can be deferred and which data cannot.
> >
> > All immutable data and all new data can obviously be added to the
> > database outside of a transaction, but eventually a transaction
> > linking this data must be part of an atomic mutation.
> >
> > The only way to support this without atomic operations on a unit
> > larger than a document, is to have a "master" document for every
> > transitive closure the graph structure requiring consistency, which in
> > effect only actually relates to immutable snapshot documents (e.g.
> > where the ID is a hash of the data). If these closures overlap then a
> > single "master" for the whole graph will be needed.
> >
> >
> > To illustrate, let's make up a social networking example. Let's say
> > you are adding a friend on this social network, and that this
> > operation involves 3 updates, one to add a link from your profile to
> > your friend's ID, another for the inverse, and a third update to
> > update to send a "hello" message to the friend, updating their inbox.
> > The first update lives in one partition, and the second and third
> > updates are on a second one.
> >
> > The back pointers in your new friends must be updated. In an fully
> > transactional model this would lock the friend's document and yours at
> > the same time, in an eventual consistency model this would queue a
> > message for the friend's partition, and a message handler on the
> > friend's partition would update this atomically "eventually". It's
> > fine for the link to be out of date for a while, but eventually it
> > needs to be fixed (e.g. if you want to remove the friend, message
> > them, etc).
> >
> > In couchdb 0.9 one of the writes will get a "conflict" error back, and
> > they could refetch the updated version and try the edit again. The
> > problem is that if the wrote the third update update to another
> > document on the same node making assumptions about the same data, that
> > write may have succeeded, leaving the data inconsistent. Under an
> > eventual consistency model you still use transactions to do these
> > updates, you just must design your model to break them down into
> > smaller units.
> >
> > The reason a graph structure is more susceptible to inconsistency is
> > that while in a relational model many data linkage operations can be
> > done with a single insert/update (e.g. `insert into edges (node1_id,
> > node2_id)`), in a document based database this type of opreation
> > involves modifying all the affected documents. The chance of
> > inconsistency is increased because contention is higher and there is
> > more data that must be synchronized.
> >
> > However, in another post Damien said:
> >
> >> Which is why in general you want to avoid inter-document dependencies,
> >> or be relaxed in how you deal with them.
> >
> > So I think I best shut up after this without some decision maker
> > telling me not to, if my use case is not covered by the intended
> > design then that's that, but I do think this thread sort of covers
> > this:
> >
> >> As far as distributed transactions go, I'd be thrilled if we could
> >> implement it and also support the rest of couchdb, like views and bi-
> >> directional replication. Please start up a discussion here in dev@
> >> about it and see if you can work out a design.
> >
> > Without going too pie-in-the-sky.
> >
> > Cheers,
> > Yuval
> >
>
>
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>

Re: reiterating transactions vs. replication

Posted by Chris Anderson <jc...@apache.org>.

On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman <no...@woobling.org> wrote:
> 2009/5/21 Adam Kocoloski <ko...@apache.org>:
>> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>> rehash all the discussion from back in February (see the thread beginning at
>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e for
>> a particularly detailed discussion), but I do want to comment on one aspect.
>>
>> Updating the replicator to be smart about atomic bulk transactions is doable
>> (although a major undertaking), but when you throw DB compaction and
>> revision stemming into the mix things get really hairy.  Recall that CouchDB
>> revisions are used for concurrency control, not for maintaining history.
>>  Consider the following sequence of events:
>>
>> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>> 2) Update foo -> foo/2
>> Compact the DB (foo/1 is deleted)
>> Start replicating to a mirror
>> Replication crashes before it reaches foo/2
>
> By crash you mean an error due to a conflict between foo/2 and foo/1'
> (the mirror's version of foo), right?
>
>> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
>> think this means we'd need to modify the compaction algorithm to keep
>> revisions of documents if a) the revision was part of an atomic _bulk_docs,
>> and b) any of the documents in that transaction are still at the revision
>> generated by the transaction.  Same thing goes for revision stemming -- we
>> can never drop revisions if they were part of an atomic upload and at least
>> one of the document revs in the upload is still current.
>
> Yep. Personally I see this is a tradeoff, not a limitation per se. If
> you specify 'atomic' then you must pay more in terms of data size,
> performance, etc.

The problem as I see it is that someone else's bulk transaction will
have to sit around in my database, until I edit all the docs in it.
Hopefully I won't get any distributed conflicts on other old versions
of docs in the group because this would put edits that I've done
locally to other documents in the bulk group, somehow less valid.

Distributed bulk transactions would make for chaotic behavior, as
someone's mostly unrelated change on a remote node could eventually
replicate to me (months later) and knock an entire line of work that
I've done into a conflict state.

If you want atomicity, put it in a single document.

Chris

>
> In 0.8 you would have theoretically had to pay by default, but didn't
> because replication broke transactions.
>
> The basic algorithm is still the same, but the garbage collected unit
> is changed (instead of garbage collecting document revisions it
> garbage collects revision sets, with the current case being a set with
> one member. The rules still apply (if this object is wholly shadowed
> by non conflicting changes then it can be disposed of)). IIRC the
> algorithm is a copying garbage collector, so this is pretty easy to do
> (you walk a DAG instead of a linked list).
>
> Under the proposed model you'd choose which operations are
> transactional and will have to pay for those.
>
>
> Anwyay, thanks for your link as well, I was reading through a rather
> boring thread and didn't see this one, so I guess I did miss out. It
> seemed to imply the discussion was done only on IRC.
>
> Anyway, here goes...
>
> The fundamental problem is that any consistent data model needs at the
> very least to have atomic primitives and ordered message passing (with
> transactional message handlers) at the per-partition level, or
> atomicity and consistency is restricted to a single document.
>
> What concerns me is Damien's post
> (http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e):
>
>> No, CouchDB replication doesn't support replicating the transactions.
>> Never has, never will. That's more like transaction log replication
>> that's in traditonal dbs, a different beast.
>>
>> For the new bulk transaction model, I'm only proposing supporting
>> eventual consistency. All changes are safe to disk, but the db may not
>> be in a consistent state right away.
>
> From what I know this assumption is wrong. Eventual consistency still
> needs atomic primitives, it's not about whether or not you have
> transactions, it's about what data they affect (eventual consistency
> involves breaking them down).
>
> Anyway, "never will" sounds pretty binding, but for the sake of argument:
>
> By using only insertions and idempotent updates for the bulk of the
> data changes and a message queue whose handlers use atomic updates to
> integrate this data one can implement a truly atomic distributed
> model, or an eventual consistency, but without this updates need to be
> restricted to exactly one document.
>
> Eventual consistency is still possible using either locks or by
> breaking down what would have been large distributed transactions into
> smaller ones, but the key is that the code that will make things
> actually consistent must still have ACID guarantees (and be dispatched
> in order).
>
> The 0.9 model CouchDB is effectively MyISAM without data loss, but
> just because the data is around doesn't mean it's possible to know
> what to do with it (loss of context), or even fix it safely (the
> conflict resolution code is susceptible to conflicts too).
>
> Unfortunately for eventual consistency to actually work the breaking
> down of operations must be done on application level, the database
> can't decide which data can be deferred and which data cannot.
>
> All immutable data and all new data can obviously be added to the
> database outside of a transaction, but eventually a transaction
> linking this data must be part of an atomic mutation.
>
> The only way to support this without atomic operations on a unit
> larger than a document, is to have a "master" document for every
> transitive closure the graph structure requiring consistency, which in
> effect only actually relates to immutable snapshot documents (e.g.
> where the ID is a hash of the data). If these closures overlap then a
> single "master" for the whole graph will be needed.
>
>
> To illustrate, let's make up a social networking example. Let's say
> you are adding a friend on this social network, and that this
> operation involves 3 updates, one to add a link from your profile to
> your friend's ID, another for the inverse, and a third update to
> update to send a "hello" message to the friend, updating their inbox.
> The first update lives in one partition, and the second and third
> updates are on a second one.
>
> The back pointers in your new friends must be updated. In an fully
> transactional model this would lock the friend's document and yours at
> the same time, in an eventual consistency model this would queue a
> message for the friend's partition, and a message handler on the
> friend's partition would update this atomically "eventually". It's
> fine for the link to be out of date for a while, but eventually it
> needs to be fixed (e.g. if you want to remove the friend, message
> them, etc).
>
> In couchdb 0.9 one of the writes will get a "conflict" error back, and
> they could refetch the updated version and try the edit again. The
> problem is that if the wrote the third update update to another
> document on the same node making assumptions about the same data, that
> write may have succeeded, leaving the data inconsistent. Under an
> eventual consistency model you still use transactions to do these
> updates, you just must design your model to break them down into
> smaller units.
>
> The reason a graph structure is more susceptible to inconsistency is
> that while in a relational model many data linkage operations can be
> done with a single insert/update (e.g. `insert into edges (node1_id,
> node2_id)`), in a document based database this type of opreation
> involves modifying all the affected documents. The chance of
> inconsistency is increased because contention is higher and there is
> more data that must be synchronized.
>
> However, in another post Damien said:
>
>> Which is why in general you want to avoid inter-document dependencies,
>> or be relaxed in how you deal with them.
>
> So I think I best shut up after this without some decision maker
> telling me not to, if my use case is not covered by the intended
> design then that's that, but I do think this thread sort of covers
> this:
>
>> As far as distributed transactions go, I'd be thrilled if we could
>> implement it and also support the rest of couchdb, like views and bi-
>> directional replication. Please start up a discussion here in dev@
>> about it and see if you can work out a design.
>
> Without going too pie-in-the-sky.
>
> Cheers,
> Yuval
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: reiterating transactions vs. replication

Posted by Yuval Kogman <no...@woobling.org>.

2009/5/22 Yuval Kogman <no...@woobling.org>:

> From what I know this assumption is wrong. Eventual consistency still
> needs atomic primitives, it's not about whether or not you have
> transactions, it's about what data they affect (eventual consistency
> involves breaking them down).

Found this link in the archive as well:

http://queue.acm.org/detail.cfm?id=1394128

I think it explains why better than I do.

Re: reiterating transactions vs. replication

Posted by Randall Leeds <ra...@gmail.com>.

On Fri, May 22, 2009 at 12:27, Randall Leeds <ra...@gmail.com>wrote:
>
>
> Since I do like the component model, I'm planning to set up a github
> project to play with some consensus protocols and overlay networks in
> Erlang. Hopefully once I start doing that I'll start to see the places that
> CouchDB can hook into it and get a nice, clean, flexible API. I see the
> problem broken into several tiers.
>
> Transactional Bulk Docs (this is the wishlist and challenge, but has to
> rest on the below)
> Sharding/Replication (_seq consensus / possibly consistent hashing or other
> distributed, deterministic data structure mapping BTree nodes to servers
> [2])
> Communication (either Erlang or a tcp with pluggable overlay-network for
> routing)
>

A revised break-down should be something like:

Transactional Bulk-Docs
Single-Doc Multi-Replica Transactions
Replication / Sharding
Network

Example:

Transactional Bulk-Docs (Server pre-prepares itself as leader for a special
bulk round)
Single-Doc Multi-Replica Transactions (Simple consensus. Special leader for
bulk case. Pre-determined leader normally.)
Replication / Sharding (Any sort of load-balancing, slicing, or static
configuration)
Network (Chord and derivatives (Scalaris uses Chord #), Tapestry, Pastry,
etc)

I think with the right configurations and components transactional bulk-docs
are just a special case of single-doc transactions. For example, in case the
single-doc layer optimizes for less communication rounds by pre-selecting
leaders on a rotating basis a bulk transaction just involves revoking all
nodes for a sequence number consensus round and using an extra round trip to
"take over" the leader position. Then all nodes holding replicas of all
documents involved would have to participate in this new round (or at least
a majority of replicas). Having 'atomic=false' could skip this expense and
make a best-effort serial execution of the updates and fail on conflict.

Just trying to keep the conversation rolling. But I understand we have to
hit the code soon if this really stands to go somewhere.

Re: reiterating transactions vs. replication

Posted by Randall Leeds <ra...@gmail.com>.

On Fri, May 22, 2009 at 00:34, Paul Davis <pa...@gmail.com>wrote:

> >>
> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> >> 2) Update foo -> foo/2
> >> Compact the DB (foo/1 is deleted)
> >> Start replicating to a mirror
> >> Replication crashes before it reaches foo/2
> >
> > By crash you mean an error due to a conflict between foo/2 and foo/1'
> > (the mirror's version of foo), right?
> >
>
> Pretty sure he means the network link fails (or code fails, etc).

What about "a node has been compromised" or "someone is spoofing messages
from one of the nodes". These questions lead me to thinking about a plug-in
component model for this work. I can imagine very different requirements
with very different overhead even to just maintain the "normal" couchdb
guarantees (putting aside any transactional _bulk_docs).

> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>  I
> >> think this means we'd need to modify the compaction algorithm to keep
> >> revisions of documents if a) the revision was part of an atomic
> _bulk_docs,
> >> and b) any of the documents in that transaction are still at the
> revision
> >> generated by the transaction.  Same thing goes for revision stemming --
> we
> >> can never drop revisions if they were part of an atomic upload and at
> least
> >> one of the document revs in the upload is still current.
> >
>
> In general, there were two main ideas that I saw when reading
> literature on this subject:
>
> 1. Keep some sort of edit history
> 2. If a replication event occurs, and the target node is too out of
> date, then trigger a full database copy.
>
> At one point I suggested something like:
>
> 1. Keep some sort of edit history.
>    We already do this. And with revision stemming we already have a
> configurable "How much history is kept" option. The fancy twist is
> that we don't remove old revisions from the update sequence btree
> until the revision stemming removes the revision info data. These
> revisions are then replicated as part of normal replication.
>
> 2. In the case of replicating from a node that's too far out of whack,
> instead of a full database copy, we just fall back to our current
> replication scheme in that we lose all transaction guarantees (or the
> guarantees that we can no longer guarantee, this is all quite hand
> wavy).
>
> For point 1, I can see one of a few methods to deal with transactions.
> Either we don't commit any of the docs in the transaction until they
> all make it across the wire, or we just mark them as a conflict (with
> maybe a 'in_transaction' modifier or some such). Keeping track of
> revisions is pretty cake because all the documents would be sequential
> in the update sequence btree. And it should also be easy to tell when
> a transaction is so old that we no longer have all the data necessary
> to make it work.

To be clear, not all documents are sequential in the update tree unless we
run some consensus protocol to decide the ordering or come up with some
other vector clock type solution. I don't know much about the latter, but
they've come up in discussions before on this list. I've thought about this
a lot lately and I really like the techniques of Mencius [1] which runs a
simplified Paxos that commits after only two communication rounds.

There's a huge win if we want to allow re-ordering of commits. This is
probably the case unless the application assumes some dependency between
documents (frequently noted as a Bad Idea). Many servers can commit writes
in one communication round. For example, a server can accept some other
server's proposal for sequence number i and commit i+1 (assuming it is the
leader for round i+1) even before it learns the result of the conensus for i
as long as i+1 and i touch different documents.

For point 2, I think we should make a distinction between inter-node
replication and replication into and out of the clustered deployment in
order to discuss it well. Inter-node migration of shards might rely on
replication, but if this is the case it should probably be triggered and
managed "from above". In other words, it might involve passing around a
"view id" which increments on any shard migration as well and having nodes
propose "view changes" to the update sequence consensus when shards migrate.
When a view change proposal is accapted, replication starts. Only when
replication ends does the rest of the group "learn" the new mapping. If the
nodes cooperate on changing to a new view with a new shard-to-node mapping I
don't think there should ever be conflicts caused by replication.

Some thought needs to go into the other scenario (replicating in/out with
the cluster viewed as a single CouchDB instance), but something tells me if
we get globally ordered seq numbers it's trivial.

> >
> > However, in another post Damien said:
> >
> >> Which is why in general you want to avoid inter-document dependencies,
> >> or be relaxed in how you deal with them.
>

See my point above.

> Though, the more times I end up writing out long responses to how we
> might do replication and the requirements and this and that the more
> likely I'll be to just tag any and all replication emails with "will
> only discuss working code". Judging from date stamps in that thread,
> its been four months and not one person has offered even a
> broken-almost-but-not-quite-working patch. In the words of Damien's
> blog tagline, "Everybody keeps on talking about it. Nobody's getting
> it done".
>

Since I do like the component model, I'm planning to set up a github project
to play with some consensus protocols and overlay networks in Erlang.
Hopefully once I start doing that I'll start to see the places that CouchDB
can hook into it and get a nice, clean, flexible API. I see the problem
broken into several tiers.

Transactional Bulk Docs (this is the wishlist and challenge, but has to rest
on the below)
Sharding/Replication (_seq consensus / possibly consistent hashing or other
distributed, deterministic data structure mapping BTree nodes to servers
[2])
Communication (either Erlang or a tcp with pluggable overlay-network for
routing)

In the long term I'd love to see CouchDB be able to handle all kinds of
deployments no one is doing right now. For example, with one component stack
you might be able to run CouchDB in a peer-to-peer overlay with replication
and automatic migration, byzantine fault tolerant consensus, and an
application which signs documents in such a way that they can be verified on
reads. Why? Who knows. You dream it. However, this argument leads me to
think that an "atomic" flag might be ok. If your deployment doesn't support
it, just send back and error code.

I think it'd be appropriate to amend that to: If anyone wants this
> feature, then start sending code. We're all happy to help introduce
> people to the code base if guidance is required, but enough time has
> gone by that its hard to seriously consider proposals with no concrete
> realization.

I have graduation ceremonies this weekend. Afterward, I hope to follow
through on this advice and invite anyone to join me. So, keep the discussion
flowing and I think we'll get some code flowing sooner rather than later.

[1] Mencius: Building Efficient Replicated State Machines for WANs
http://www.sysnet.ucsd.edu/~yamao/pub/mencius-osdi.pdf

Re: reiterating transactions vs. replication

Posted by Paul Davis <pa...@gmail.com>.

>>
>> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>> 2) Update foo -> foo/2
>> Compact the DB (foo/1 is deleted)
>> Start replicating to a mirror
>> Replication crashes before it reaches foo/2
>
> By crash you mean an error due to a conflict between foo/2 and foo/1'
> (the mirror's version of foo), right?
>

Pretty sure he means the network link fails (or code fails, etc).

>> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
>> think this means we'd need to modify the compaction algorithm to keep
>> revisions of documents if a) the revision was part of an atomic _bulk_docs,
>> and b) any of the documents in that transaction are still at the revision
>> generated by the transaction.  Same thing goes for revision stemming -- we
>> can never drop revisions if they were part of an atomic upload and at least
>> one of the document revs in the upload is still current.
>

In general, there were two main ideas that I saw when reading
literature on this subject:

1. Keep some sort of edit history
2. If a replication event occurs, and the target node is too out of
date, then trigger a full database copy.

At one point I suggested something like:

1. Keep some sort of edit history.
    We already do this. And with revision stemming we already have a
configurable "How much history is kept" option. The fancy twist is
that we don't remove old revisions from the update sequence btree
until the revision stemming removes the revision info data. These
revisions are then replicated as part of normal replication.

2. In the case of replicating from a node that's too far out of whack,
instead of a full database copy, we just fall back to our current
replication scheme in that we lose all transaction guarantees (or the
guarantees that we can no longer guarantee, this is all quite hand
wavy).

For point 1, I can see one of a few methods to deal with transactions.
Either we don't commit any of the docs in the transaction until they
all make it across the wire, or we just mark them as a conflict (with
maybe a 'in_transaction' modifier or some such). Keeping track of
revisions is pretty cake because all the documents would be sequential
in the update sequence btree. And it should also be easy to tell when
a transaction is so old that we no longer have all the data necessary
to make it work.

As Yuval describes, the underlying idea would be that you only pay the
cost if you so choose.

On the flip side, this adds a decent amount of complexity to the
replicator and book keeping to other parts of the database.

> Yep. Personally I see this is a tradeoff, not a limitation per se. If
> you specify 'atomic' then you must pay more in terms of data size,
> performance, etc.
>

[snip]

> What concerns me is Damien's post
> (http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e):
>
>> No, CouchDB replication doesn't support replicating the transactions.
>> Never has, never will. That's more like transaction log replication
>> that's in traditonal dbs, a different beast.
>>
>> For the new bulk transaction model, I'm only proposing supporting
>> eventual consistency. All changes are safe to disk, but the db may not
>> be in a consistent state right away.
>
> From what I know this assumption is wrong. Eventual consistency still
> needs atomic primitives, it's not about whether or not you have
> transactions, it's about what data they affect (eventual consistency
> involves breaking them down).
>

I'm not sure I follow this part. What aspect of eventual consistency
requires atomicity guarantees? CouchDB eventual consistency is like
making dinner plans with a large group of friends. Sometimes different
parts of the network might have a different idea of which restaurant
everyone's meeting at, but assuming everyone remembered to charge
their phones eventually everyone will get to the right place.

> Anyway, "never will" sounds pretty binding, but for the sake of argument:
>

I think he was referring to the heavy log replication stuff that
RDBMS' tend towards. From what I've read these types of approaches
require runtime characteristics that don't fit with the rest of
CouchDB.

If we found a transaction model that worked without hampering the core
design goals of CouchDB then I'm pretty sure everyone would be
extremely enthused about it.

[snip]

>
> However, in another post Damien said:
>
>> Which is why in general you want to avoid inter-document dependencies,
>> or be relaxed in how you deal with them.
>
> So I think I best shut up after this without some decision maker
> telling me not to, if my use case is not covered by the intended
> design then that's that, but I do think this thread sort of covers
> this:
>

Damien's advice is the best idea for most scenarios. It may end up
causing a bit more planning up front for what happens if you have
conflicts and how to take of such things, but as it turns out, once
you have it working, then you have a huge amount of awesome you can
tap into that just isn't available otherwise (without orders of
magnitude more pain, etc, etc).

I won't tell anyone to shut up, especially when they've clearly done
some thinking and have good insight into the problem. I will say that
this particular problem has come up and I have a feeling that there
are more people than just me that are a bit weary from it. I only took
the time to respond this time because you'd made such a reasoned
argument.

Though, the more times I end up writing out long responses to how we
might do replication and the requirements and this and that the more
likely I'll be to just tag any and all replication emails with "will
only discuss working code". Judging from date stamps in that thread,
its been four months and not one person has offered even a
broken-almost-but-not-quite-working patch. In the words of Damien's
blog tagline, "Everybody keeps on talking about it. Nobody's getting
it done".

>> As far as distributed transactions go, I'd be thrilled if we could
>> implement it and also support the rest of couchdb, like views and bi-
>> directional replication. Please start up a discussion here in dev@
>> about it and see if you can work out a design.
>
> Without going too pie-in-the-sky.
>

I think it'd be appropriate to amend that to: If anyone wants this
feature, then start sending code. We're all happy to help introduce
people to the code base if guidance is required, but enough time has
gone by that its hard to seriously consider proposals with no concrete
realization.

> Cheers,
> Yuval
>

HTH,
Paul Davis

Re: reiterating transactions vs. replication

Posted by Yuval Kogman <no...@woobling.org>.

2009/5/21 Adam Kocoloski <ko...@apache.org>:
> Hi Yuval, thanks for this well-written proposal.  I don't really want to
> rehash all the discussion from back in February (see the thread beginning at
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e for
> a particularly detailed discussion), but I do want to comment on one aspect.
>
> Updating the replicator to be smart about atomic bulk transactions is doable
> (although a major undertaking), but when you throw DB compaction and
> revision stemming into the mix things get really hairy.  Recall that CouchDB
> revisions are used for concurrency control, not for maintaining history.
>  Consider the following sequence of events:
>
> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> 2) Update foo -> foo/2
> Compact the DB (foo/1 is deleted)
> Start replicating to a mirror
> Replication crashes before it reaches foo/2

By crash you mean an error due to a conflict between foo/2 and foo/1'
(the mirror's version of foo), right?

> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
> think this means we'd need to modify the compaction algorithm to keep
> revisions of documents if a) the revision was part of an atomic _bulk_docs,
> and b) any of the documents in that transaction are still at the revision
> generated by the transaction.  Same thing goes for revision stemming -- we
> can never drop revisions if they were part of an atomic upload and at least
> one of the document revs in the upload is still current.

Yep. Personally I see this is a tradeoff, not a limitation per se. If
you specify 'atomic' then you must pay more in terms of data size,
performance, etc.

In 0.8 you would have theoretically had to pay by default, but didn't
because replication broke transactions.

The basic algorithm is still the same, but the garbage collected unit
is changed (instead of garbage collecting document revisions it
garbage collects revision sets, with the current case being a set with
one member. The rules still apply (if this object is wholly shadowed
by non conflicting changes then it can be disposed of)). IIRC the
algorithm is a copying garbage collector, so this is pretty easy to do
(you walk a DAG instead of a linked list).

Under the proposed model you'd choose which operations are
transactional and will have to pay for those.


Anwyay, thanks for your link as well, I was reading through a rather
boring thread and didn't see this one, so I guess I did miss out. It
seemed to imply the discussion was done only on IRC.

Anyway, here goes...

The fundamental problem is that any consistent data model needs at the
very least to have atomic primitives and ordered message passing (with
transactional message handlers) at the per-partition level, or
atomicity and consistency is restricted to a single document.

What concerns me is Damien's post
(http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e):

> No, CouchDB replication doesn't support replicating the transactions.
> Never has, never will. That's more like transaction log replication
> that's in traditonal dbs, a different beast.
>
> For the new bulk transaction model, I'm only proposing supporting
> eventual consistency. All changes are safe to disk, but the db may not
> be in a consistent state right away.

>From what I know this assumption is wrong. Eventual consistency still
needs atomic primitives, it's not about whether or not you have
transactions, it's about what data they affect (eventual consistency
involves breaking them down).

Anyway, "never will" sounds pretty binding, but for the sake of argument:

By using only insertions and idempotent updates for the bulk of the
data changes and a message queue whose handlers use atomic updates to
integrate this data one can implement a truly atomic distributed
model, or an eventual consistency, but without this updates need to be
restricted to exactly one document.

Eventual consistency is still possible using either locks or by
breaking down what would have been large distributed transactions into
smaller ones, but the key is that the code that will make things
actually consistent must still have ACID guarantees (and be dispatched
in order).

The 0.9 model CouchDB is effectively MyISAM without data loss, but
just because the data is around doesn't mean it's possible to know
what to do with it (loss of context), or even fix it safely (the
conflict resolution code is susceptible to conflicts too).

Unfortunately for eventual consistency to actually work the breaking
down of operations must be done on application level, the database
can't decide which data can be deferred and which data cannot.

All immutable data and all new data can obviously be added to the
database outside of a transaction, but eventually a transaction
linking this data must be part of an atomic mutation.

The only way to support this without atomic operations on a unit
larger than a document, is to have a "master" document for every
transitive closure the graph structure requiring consistency, which in
effect only actually relates to immutable snapshot documents (e.g.
where the ID is a hash of the data). If these closures overlap then a
single "master" for the whole graph will be needed.


To illustrate, let's make up a social networking example. Let's say
you are adding a friend on this social network, and that this
operation involves 3 updates, one to add a link from your profile to
your friend's ID, another for the inverse, and a third update to
update to send a "hello" message to the friend, updating their inbox.
The first update lives in one partition, and the second and third
updates are on a second one.

The back pointers in your new friends must be updated. In an fully
transactional model this would lock the friend's document and yours at
the same time, in an eventual consistency model this would queue a
message for the friend's partition, and a message handler on the
friend's partition would update this atomically "eventually". It's
fine for the link to be out of date for a while, but eventually it
needs to be fixed (e.g. if you want to remove the friend, message
them, etc).

In couchdb 0.9 one of the writes will get a "conflict" error back, and
they could refetch the updated version and try the edit again. The
problem is that if the wrote the third update update to another
document on the same node making assumptions about the same data, that
write may have succeeded, leaving the data inconsistent. Under an
eventual consistency model you still use transactions to do these
updates, you just must design your model to break them down into
smaller units.

The reason a graph structure is more susceptible to inconsistency is
that while in a relational model many data linkage operations can be
done with a single insert/update (e.g. `insert into edges (node1_id,
node2_id)`), in a document based database this type of opreation
involves modifying all the affected documents. The chance of
inconsistency is increased because contention is higher and there is
more data that must be synchronized.

However, in another post Damien said:

> Which is why in general you want to avoid inter-document dependencies,
> or be relaxed in how you deal with them.

So I think I best shut up after this without some decision maker
telling me not to, if my use case is not covered by the intended
design then that's that, but I do think this thread sort of covers
this:

> As far as distributed transactions go, I'd be thrilled if we could
> implement it and also support the rest of couchdb, like views and bi-
> directional replication. Please start up a discussion here in dev@
> about it and see if you can work out a design.

Without going too pie-in-the-sky.

Cheers,
Yuval

Re: reiterating transactions vs. replication

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Yuval, thanks for this well-written proposal.  I don't really want  
to rehash all the discussion from back in February (see the thread  
beginning at http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e 
  for a particularly detailed discussion), but I do want to comment on  
one aspect.

Updating the replicator to be smart about atomic bulk transactions is  
doable (although a major undertaking), but when you throw DB  
compaction and revision stemming into the mix things get really  
hairy.  Recall that CouchDB revisions are used for concurrency  
control, not for maintaining history.  Consider the following sequence  
of events:

1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
2) Update foo -> foo/2
Compact the DB (foo/1 is deleted)
Start replicating to a mirror
Replication crashes before it reaches foo/2

In your proposal, we should expect foo/1 to exist on the mirror,  
right?  I think this means we'd need to modify the compaction  
algorithm to keep revisions of documents if a) the revision was part  
of an atomic _bulk_docs, and b) any of the documents in that  
transaction are still at the revision generated by the transaction.   
Same thing goes for revision stemming -- we can never drop revisions  
if they were part of an atomic upload and at least one of the document  
revs in the upload is still current.

Do you agree?  Best, Adam

On May 21, 2009, at 7:00 AM, Yuval Kogman wrote:

> Hello,
>
> In 0.9 CouchDB removed the transactional bulk docs feature in favour
> of simplifying sharding/replication.
>
> The priorities behind this decision as I understood them are:
>
>    1. ensure that applications developed in a single server don't
> suffer from a degradation of guarantees if deployed using sharding
>
>    2. avoid the issues involving transactional
>
>
> I apologize if this proposal has already dismissed before. I did
> search the mailing list archives, but mostly found a discussion on why
> this stuff should not be done on IRC. I blame Jan for encouraging me
> to post ;-)
>
>
>
> So anyway, I think that we can have both features without needing to
> implement something like Paxos, and without silently breaking apps
> when they move to a sharding setup from a single machine setup.
>
>
> The basic idea is to keep the conflicts-are-data approach, keeping the
> current user visible replication and conflict resolution, but to allow
> the user to ask for stricter conflict checking.
>
> The api I propose is for the bulk docs operation to have an optional
> 'atomic' flag. When this flag is set CouchDB would atomically verify
> that all documents were committed without conflict (with respect to
> the supplied _rev), and if any one document conflicts, mark all of
> them as conflicting.
>
> Transaction recovery, conflict resolution etc is still the
> responsibility of the application, but provides an atomic guarantee
> that an inconsistent transaction will fail as a whole if it tries to
> write inconsistent data to the database, a guarantee that cannot be
> made using a client library (there are race conditions).
>
>
>
> Now the hard parts:
>
>
> 1. Replication
>
> The way I understand it replication currently works on a per document
> approach. If 'atomic' was specified in a bulk operation I propose that
> all the revisions created in that bulk operation were kept linked. If
> these linked revisions are being replicated, the same conflict
> resolution must be applied (the replication of the document itself is
> executed as bulk operation with aotmic=true, replicating all
> associated documents as well).
>
> The caveat is that even if you always use bulk docs with the atomic
> flag, if you a switch replica you could lose the D out of ACID:
> documents which are marked as non conflicting in your replica might be
> conflicting in the replica you switch to, in which case transactions
> that have already been committed appear to be rolled back from the
> application's point of view.
>
> This problem obviously already exists in the current implementation,
> but when 'atomic' is specified it could potentially happen a lot more
> often.
>
>
> 2. Data sharding
>
> This one is tricker. Two solutions both of which I think are
> acceptable, and either or both of which could be used:
>
>
> The easy way is to ignore this problem. Well not really: The client
> must ensure that all the documents affected by a single transaction
> are in the same shard, by using a partitioning scheme that allows
> this.
>
> If a bulk_docs operation with atomic set to true would affect multiple
> shards, that is an error (the data could still be written as a
> conflict, of course).
>
> If you want to enable the 'atomic' flag you'll need to be careful
> about how you use sharding. You can still use it for some of the
> transactions, but not all the time. I think this is a flexible and
> pragmatic solution.
>
> This means that if you choose to opt in to fully atomic bulk doc
> operations your app might not be deployable unmodified to a sharded
> setup, but it's never unsafe (no data inconsistencies).
>
> In my opinion this satisfies the requirement for no degredation of
> guarantees. It might not Just Work, but you can't have your cake and
> eat it too at the end of the day.
>
>
>
>
> The second way is harder but potentially still interesting. I've
> included it mostly for the sake of discussion.
>
> The core idea is to provide low level primitives on top of which a
> client or proxy can implement a multi phase commit protocol.
>
> The number of nodes involved is in the transaction depends on the data
> in the transaction (it doesn't need to coordinate all the nodes in the
> cluster).
>
> Basically this would breakdown bulk doc calls into several steps.
> First all the data is inserted to the backend, but it's set as
> conflicting so that it's not accidentally visible.
>
> This operation returns an identifier for the bulk doc operation
> (essentially a ticket for a prepared transaction).
>
> Once the data is available on all the shards it must be made live
> atomically. A two phase commit starts by acquiring locks on all the
> the transaction tickets and trying to apply them (the 'promise'
> phase), and then finalizing that application atomically (the 'accept'
> phase).
>
> To keep things simple the two phases should be scoped to a single keep
> alive connection. If the connection drops the locks should be
> released.
>
> Obviously Paxos ensues, but here's the catch:
>
> - The synchronization can be implemented first as a 3rd party
> component, it doesn't need to affect CouchDB's core
>
> - The atomic primitives are also useful for writing safe conflict
> resolution tools that handle conflicts that span multiple documents.
>
> So even if no one ends up implementing real Multi Paxos in the end
> CouchDB still benefits from having reusable synchronization
> primitives. (If this is interesting to anyone, see below [1])
>
>
>
>
> I'd like to stress that this is all possible to do on top of the
> current 0.9 semantics. The default behavior in 0.9 does not change at
> all. You have to opt in to this more complex behavior.
>
> The problem with 0.9 is that there is no way to ensure atomicity and
> isolation from a client library, it must be done on the server, so by
> removing the ability to do this at all, couchdb is essentially no
> longer transaction. It's protected from internal data corruption, and
> it's protected from data loss (unlike say MyISAM which will happily
> overwrite your correct data), but it's still a potentially lossy model
> since conflicting revisions are not correlated. This means that you
> cannot have a transactional graph model, it's either or.
>
>
>
> Finally, you should know that I have no personal stake in this. I
> don't rely on CouchDB (yet), but I think it's somewhat relevant for a
> project I work on, and that the requirements for this project are not
> that far fetched. I'm the author of an object graph storage engine for
> Perl called KiokuDB. It serializes every object to a document unless
> told otherwise but the data is still a highly interconnected graph. As
> a user of this system I care a lot about transactions, but not at all
> about sharding (this might not hold for all the users of KiokuDB).
> Basically I already have what I need from KiokuDB; there are numerous
> backends for this system that satisfy me (BerkeleyDB, a transactional
> plain file backend, DBI (PostgreSQL, SQLite, MySQL)), and a number
> that don't fit my personal needs due to lack of transactions
> (SimpleDB, MongoDB, and CouchDB since 0.9).
>
> If things stays this way, then CouchDB is simply not intended for
> users like me (though I'll certainly still maintain
> KiokuDB::Backend::CouchDB).
>
> However, I do *want* to use CouchDB. I think that under many scenarios
> it has clear advantages compared to the other backends (mostly the
> fact that it's so easy, but also views support is nice). I think it'd
> be a shame if what was preventing me was a fix that ended up being a
> low hanging fruit to which no one objected.
>
>
> Regards,
> Yuval
>
>
>
> [1] in Paxos terms the CouchDB shards would do the Acceptor role and
> the client (be it the actual client or a sharding proxy, whatever
> delegates and consolidates the views) performs the the Learner role.
> Only the Learner is considered authoritative with respect to the final
> status of a transaction.
>
> Hard crashes of a shard during the 'accept' phase may produce
> inconsistent results if more than one Learner is used to proxy write
> operations. Focusing on this scenario is a *HUGE* pessimization. High
> availability of the Learner role can still be achieved using
> BerkeleyDB style master failover (voting).
>
> This transactional sharding proxy could of course also guarantee
> redundancy of shards.
>
> My point is that Paxos can be implemented as a 3rd party component if
> anyone actually wants/needs it, by providing comparatively simple
> primitives.