You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Milind Parikh <mi...@gmail.com> on 2011/04/22 22:31:05 UTC

Manual Conflict Resolution in Cassandra

Is there a chance of getting manual conflict resolution in Cassandra?
Please see attachment for why this is important in some cases.

Regards
Milind

Re: Manual Conflict Resolution in Cassandra

Posted by Oleg Anastasyev <ol...@gmail.com>.

David Strauss <david <at> davidstrauss.net> writes:

> 
> You can actually already perform "manual conflict resolution" in
> Cassandra by naming your columns so that they don't squash each other in
> Cassandra's internal replication. Then, ensure the code that accesses
> Cassandra reads all columns with data you might need to construct the
> resolved value. You could even cache the resolved value if pulling a set
> of columns isn't efficient for you.

In my case this approach rised cost of the system 2x, because amount of data
need to be stored increased (columns were not compacted because now they have
different names -> more storage, more servers) and led to another month of
development. 
I believe for apps with many updates to the same column this solution just 
dont fit.

> 
> 
> It seems with Cassandra 0.8's counter support that more pluggable
> conflict resolution may not be far off.

This will only solve incrementing counters. Cassandra really needs vector 
clocks with pluggable app level conflict resolution support on compaction 
and read stages.

> 
> I wouldn't call the updates here "silently dropped" because,  in your
> example, Cassandra's conflict resolution is working as-documented. The
> update with the later timestamp is, indeed, retained. Cassandra is not
> an ACID-compliant system, nor does it strive to be.

Even if design flaw is documented, it is still a design flaw. In practice,
single timestamp based conflict resolution hides conflicts from developer, not
resolves them. And it forces developer to implement home grown conflict
detection and resolution schemas, which are very difficult to debug, instead of
having them solved in generic way once.

Currently, for systems with high update rate to single column my advice would
be: just dont use cassandra.

Re: Manual Conflict Resolution in Cassandra

Posted by David Strauss <da...@davidstrauss.net>.

On Mon, 2011-04-25 at 03:50 -0700, Milind Parikh wrote:
> I suppose the term 'silently dropped' is a matter of perspective. C
> makes an explicit automated choice of latest-timestamp-wins. In
> certain situations, this is not the appropriate choice.

I would still insist that using Cassandra and expecting different
behavior makes the *usage* inappropriate, not Cassandra's design. You
*can* fairly say that Cassandra's design limits its potential uses, but
that's subtly different.

As an analogy:
If the engine in a Ford Focus can't tow a trailer with 20 tons of cargo,
you wouldn't say, "In certain situations, the engine Ford chose for the
Focus is not appropriate." No, the inappropriate thing would be choosing
a Focus to tow such a load. The trade-off of not being able to tow 20
tons was intentionally made to maximize fuel-efficiency, lower cost,
reduce complexity, or some other reasonable goal.

In Cassandra, the decision to resolve conflicts by keeping the item with
the later timestamp resulted from careful consideration of the options
(especially vector clocks). And, even if the current design is the
product of minimizing complexity, that is a valid design goal and not an
"inappropriate choice."

Cassandra may eventually gain the capability you're requesting, but
please stop pretending it's a bug or sign of bad judgment.

David

Re: Manual Conflict Resolution in Cassandra

Posted by Milind Parikh <mi...@gmail.com>.

I suppose the term 'silently dropped' is a matter of perspective. C makes an
explicit automated choice of latest-timestamp-wins. In certain situations,
this is not the appropriate choice.

Regards
Milind
/***********************
sent from my android...please pardon occasional typos as I respond @ the
speed of thought
************************/

On Apr 25, 2011 3:54 AM, "David Strauss" <da...@davidstrauss.net> wrote:

On Fri, 2011-04-22 at 13:31 -0700, Milind Parikh wrote:
> Is there a chance of getting manual confli...
You can actually already perform "manual conflict resolution" in
Cassandra by naming your columns so that they don't squash each other in
Cassandra's internal replication. Then, ensure the code that accesses
Cassandra reads all columns with data you might need to construct the
resolved value. You could even cache the resolved value if pulling a set
of columns isn't efficient for you.

Optionally, have your code remove columns with obsolete data that's
older than any tolerable window of having a "split brain." This approach
is not elegant, but it would solve the scenario in your PDF.

It seems with Cassandra 0.8's counter support that more pluggable
conflict resolution may not be far off.

>From the PDF:
> Under Quorum, Cassandra guarantees that once a read has seen a write, all
> others will see that same write.

That's not quite true. Under quorum reads and writes, Cassandra
guarantees that a successful read will get data *at least as fresh as*
the last successful write.

Your statement seems to reference something akin to Cassandra's "read
repair" functionality, which is present even at consistency levels lower
than quorum. However, "read repair," gives high likelihood (not a
guarantee) that once a read has happened on a column, subsequent reads
will see the most current write, even if the first read didn't.

>From the PDF:
> It is possible in Cassandra to have updates being silently dropped.

I wouldn't call the updates here "silently dropped" because,  in your
example, Cassandra's conflict resolution is working as-documented. The
update with the later timestamp is, indeed, retained. Cassandra is not
an ACID-compliant system, nor does it strive to be.

David

Re: Manual Conflict Resolution in Cassandra

Posted by David Strauss <da...@davidstrauss.net>.

On Fri, 2011-04-22 at 13:31 -0700, Milind Parikh wrote:
> Is there a chance of getting manual conflict resolution in Cassandra?
> Please see attachment for why this is important in some cases.

You can actually already perform "manual conflict resolution" in
Cassandra by naming your columns so that they don't squash each other in
Cassandra's internal replication. Then, ensure the code that accesses
Cassandra reads all columns with data you might need to construct the
resolved value. You could even cache the resolved value if pulling a set
of columns isn't efficient for you.

Optionally, have your code remove columns with obsolete data that's
older than any tolerable window of having a "split brain." This approach
is not elegant, but it would solve the scenario in your PDF.

It seems with Cassandra 0.8's counter support that more pluggable
conflict resolution may not be far off.

From the PDF:
> Under Quorum, Cassandra guarantees that once a read has seen a write, all
> others will see that same write.

That's not quite true. Under quorum reads and writes, Cassandra
guarantees that a successful read will get data *at least as fresh as*
the last successful write.

Your statement seems to reference something akin to Cassandra's "read
repair" functionality, which is present even at consistency levels lower
than quorum. However, "read repair," gives high likelihood (not a
guarantee) that once a read has happened on a column, subsequent reads
will see the most current write, even if the first read didn't.

From the PDF:
> It is possible in Cassandra to have updates being silently dropped.

I wouldn't call the updates here "silently dropped" because,  in your
example, Cassandra's conflict resolution is working as-documented. The
update with the later timestamp is, indeed, retained. Cassandra is not
an ACID-compliant system, nor does it strive to be.

David

Re: Manual Conflict Resolution in Cassandra

Posted by Narendra Sharma <na...@gmail.com>.

>>>At t8 The request would not start as the CL level of nodes is not
available, the write would not be written to node X. The client would get an
UnavailableException. In response it should connect to a new coordinator and
try again.
[Naren] There may (and most likely there will  be) be a window when CL will
be satisfied while write will still fail because the node is actually down.
There are lot of possible scenarios here. I believe Milind is talking about
some extreme but likely cases.



On Sat, Apr 23, 2011 at 7:31 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Have not read the whole thing just the time line. Couple of issues...
>
> At t8 The request would not start as the CL level of nodes is not
> available, the write would not be written to node X. The client would get an
> UnavailableException. In response it should connect to a new coordinator and
> try again.
>
> At t12 if RR is enabled for the request the read is sent to all UP
> endpoints for the key. Once CL requests have returned (including the data /
> non digest request) the responses are repaired and a synchronous (to the
> read request) RR round is initiated.
>
> Once all the requests have responded they are compared again an async RR
> process is kicked off. So it seems that in a worse case scenario two round
> of RR are possible, one to make sure the correct data is returned for the
> request. And another to make sure that all UP replicas agree, as it may not
> be the case that all UP replicas were involved in completing the request.
>
> So as written, at t8 the write would have failed and not be stored on any
> nodes. So the write at t7 would not be lost.
>
> I think the crux of this example is the failure mode at t8, I'm assuming
> Alice is connected to node x:
>
> 1) if X is disconnected before the write starts, it will not start any
> write that requires Quorum CL. Write fails with Unavailable error.
> 2) If X disconnects from the network *after* sending the write messages,
> and all messages are successfully  actioned (including a local write) the
> request will fail with a TimedOutException as < CL nodes will respond.
> 3) If X disconnects from the cluster after sending the messages, and the
> messages it  sends are lost but the local write succeeds. The request will
> fail with a TimedOutException as < CL nodes will respond.
>
> In all these cases the request is considered to have failed. The client
> should connect to another node and try again. In the case of timeout the
> operation was not completed to the CL level you asked for. In the case of
> unavailable the operation was not started.
>
> It can look like the RR conflict resolution is a little naive here, but
> it's less simple when you consider another scenario. The write at t8 failed
> at Quorum, and in your deployment the client cannot connect to another node
> in the cluster, so your code drops the CL down to ONE and gets the write
> done. You are happy that any nodes in Alice's partition see her write, and
> that those in Bens partition see he's. When things get back to normal you
> want the most recent write to what clients consistently see, not the most
> popular value. The Consistency section here
> http://wiki.apache.org/cassandra/ArchitectureOverview says the same, it's
> the most recent value.
>
> I tend to think of Consistency as all clients getting the same response to
> the same query.
>
> Not sure if I've made things clearer, feel free to poke holes in my logic
> :)
>
> Hope that helps.
> Aaron
>
>
> On 23 Apr 2011, at 09:02, Edward Capriolo wrote:
>
> On Fri, Apr 22, 2011 at 4:31 PM, Milind Parikh <mi...@gmail.com>
> wrote:
>
> Is there a chance of getting manual conflict resolution in Cassandra?
>
> Please see attachment for why this is important in some cases.
>
>
> Regards
>
> Milind
>
>
>
>
> I think about this often. LDAP servers like SunOne have pluggable
> conflict resolution. I could see the read-repair algorithm being
> pluggable.
>
>
>


-- 
Narendra Sharma
Solution Architect
*http://www.persistentsys.com*
*http://narendrasharma.blogspot.com/*

Re: Manual Conflict Resolution in Cassandra

Posted by aaron morton <aa...@thelastpickle.com>.

Have not read the whole thing just the time line. Couple of issues...

At t8 The request would not start as the CL level of nodes is not available, the write would not be written to node X. The client would get an UnavailableException. In response it should connect to a new coordinator and try again. 

At t12 if RR is enabled for the request the read is sent to all UP endpoints for the key. Once CL requests have returned (including the data / non digest request) the responses are repaired and a synchronous (to the read request) RR round is initiated. 

Once all the requests have responded they are compared again an async RR process is kicked off. So it seems that in a worse case scenario two round of RR are possible, one to make sure the correct data is returned for the request. And another to make sure that all UP replicas agree, as it may not be the case that all UP replicas were involved in completing the request. 

So as written, at t8 the write would have failed and not be stored on any nodes. So the write at t7 would not be lost.  

I think the crux of this example is the failure mode at t8, I'm assuming Alice is connected to node x:

1) if X is disconnected before the write starts, it will not start any write that requires Quorum CL. Write fails with Unavailable error. 
2) If X disconnects from the network *after* sending the write messages, and all messages are successfully  actioned (including a local write) the request will fail with a TimedOutException as < CL nodes will respond. 
3) If X disconnects from the cluster after sending the messages, and the messages it  sends are lost but the local write succeeds. The request will fail with a TimedOutException as < CL nodes will respond. 

In all these cases the request is considered to have failed. The client should connect to another node and try again. In the case of timeout the operation was not completed to the CL level you asked for. In the case of unavailable the operation was not started.

It can look like the RR conflict resolution is a little naive here, but it's less simple when you consider another scenario. The write at t8 failed at Quorum, and in your deployment the client cannot connect to another node in the cluster, so your code drops the CL down to ONE and gets the write done. You are happy that any nodes in Alice's partition see her write, and that those in Bens partition see he's. When things get back to normal you want the most recent write to what clients consistently see, not the most popular value. The Consistency section here http://wiki.apache.org/cassandra/ArchitectureOverview says the same, it's the most recent value.

I tend to think of Consistency as all clients getting the same response to the same query.  

Not sure if I've made things clearer, feel free to poke holes in my logic :)

Hope that helps.
Aaron

On 23 Apr 2011, at 09:02, Edward Capriolo wrote:

> On Fri, Apr 22, 2011 at 4:31 PM, Milind Parikh <mi...@gmail.com> wrote:
>> Is there a chance of getting manual conflict resolution in Cassandra?
>> Please see attachment for why this is important in some cases.
>> 
>> Regards
>> Milind
>> 
>> 
> 
> I think about this often. LDAP servers like SunOne have pluggable
> conflict resolution. I could see the read-repair algorithm being
> pluggable.

Re: Manual Conflict Resolution in Cassandra

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Apr 22, 2011 at 4:31 PM, Milind Parikh <mi...@gmail.com> wrote:
> Is there a chance of getting manual conflict resolution in Cassandra?
> Please see attachment for why this is important in some cases.
>
> Regards
> Milind
>
>

I think about this often. LDAP servers like SunOne have pluggable
conflict resolution. I could see the read-repair algorithm being
pluggable.