You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "tijoriwala.ritesh" <ti...@gmail.com> on 2011/02/23 05:22:16 UTC

How does Cassandra handle failure during synchronous writes

Hi,
I wanted to get details on how does cassandra do synchronous writes to W
replicas (out of N)? Does it do a 2PC? If not, how does it deal with
failures of of nodes before it gets to write to W replicas? If the
orchestrating node cannot write to W nodes successfully, I guess it will
fail the write operation but what happens to the completed writes on X (W >
X) nodes?

Thanks,
Ritesh
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: How does Cassandra handle failure during synchronous writes

Posted by Aaron Morton <aa...@thelastpickle.com>.
At CL levels high than ANY hinted handoff will be used if enabled. It does not contribute to the number of replicas considered written by the coordinator though. E.g. If you ask for quorum, and this is 3 nodes, and only 2 are up the write will fail without starting. In this case the HH is included in the message sent to one of the up nodes.

At CL any HH is accepted as a viable replica. Even if all the natural endpoints are down the coordinator node will store the HH.

aaron
On 24/02/2011, at 3:28 AM, Javier Canillas <ja...@gmail.com> wrote:
> There is something call Hinted Handoff. Suppose that you WRITE something with ConsistencyLevel.ONE on a cluster defined by 4 nodes. Then, the write is done on the corresponding node and it is returned an OK to the client, even if the ReplicationFactor over the destination Keyspace is set to a higher value.
> 
> If in that write, one of the replicated nodes is down, then the coordinator node (the one that will hold value if first place) will mark that replication message as not sent and will retry eventually, making the replication happens.
> 
> Please, if I have explained it wrongly correct me. 
> 
> On Wed, Feb 23, 2011 at 5:45 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
> In the case described below if less than CL nodes respond in rpc_timeout (from conf yaml) the client will get a timeout error. I think most higher level clients will automatically retry in this case.
> 
> If there are not enough nodes to start the request you will get an Unavailable exception. Again the client can retry safely.
> 
> Aaron
> 
> 
> On 23/02/2011, at 8:07 PM, Dave Revell <da...@meebo-inc.com> wrote:
> 
>> Ritesh,
>> 
>> There is no commit protocol. Writes may be persisted on some replicas even though the quorum fails. Here's a sequence of events that shows the "problem:"
>> 
>> 1. Some replica R fails, but recently, so its failure has not yet been detected
>> 2. A client writes with consistency > 1
>> 3. The write goes to all replicas, all replicas except R persist the write to disk
>> 4. Replica R never responds
>> 5. Failure is returned to the client, but the new value is still in the cluster, on all replicas except R.
>> 
>> Something very similar could happen for CL QUORUM.
>> 
>> This is a conscious design decision because a commit protocol would constitute tight coupling between nodes, which goes against the Cassandra philosophy. But unfortunately you do have to write your app with this case in mind.
>> 
>> Best,
>> Dave
>> 
>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <ti...@gmail.com> wrote:
>> 
>> Hi,
>> I wanted to get details on how does cassandra do synchronous writes to W
>> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>> failures of of nodes before it gets to write to W replicas? If the
>> orchestrating node cannot write to W nodes successfully, I guess it will
>> fail the write operation but what happens to the completed writes on X (W >
>> X) nodes?
>> 
>> Thanks,
>> Ritesh
>> --
>> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>> 
> 

Re: How does Cassandra handle failure during synchronous writes

Posted by Javier Canillas <ja...@gmail.com>.
There is something call Hinted Handoff. Suppose that you WRITE something
with ConsistencyLevel.ONE on a cluster defined by 4 nodes. Then, the write
is done on the corresponding node and it is returned an OK to the client,
even if the ReplicationFactor over the destination Keyspace is set to a
higher value.

If in that write, one of the replicated nodes is down, then the coordinator
node (the one that will hold value if first place) will mark that
replication message as not sent and will retry eventually, making the
replication happens.

Please, if I have explained it wrongly correct me.

On Wed, Feb 23, 2011 at 5:45 AM, Aaron Morton <aa...@thelastpickle.com>wrote:

> In the case described below if less than CL nodes respond in rpc_timeout
> (from conf yaml) the client will get a timeout error. I think most higher
> level clients will automatically retry in this case.
>
> If there are not enough nodes to start the request you will get an
> Unavailable exception. Again the client can retry safely.
>
> Aaron
>
>
> On 23/02/2011, at 8:07 PM, Dave Revell <da...@meebo-inc.com> wrote:
>
> Ritesh,
>
> There is no commit protocol. Writes may be persisted on some replicas even
> though the quorum fails. Here's a sequence of events that shows the
> "problem:"
>
> 1. Some replica R fails, but recently, so its failure has not yet been
> detected
> 2. A client writes with consistency > 1
> 3. The write goes to all replicas, all replicas except R persist the write
> to disk
> 4. Replica R never responds
> 5. Failure is returned to the client, but the new value is still in the
> cluster, on all replicas except R.
>
> Something very similar could happen for CL QUORUM.
>
> This is a conscious design decision because a commit protocol would
> constitute tight coupling between nodes, which goes against the Cassandra
> philosophy. But unfortunately you do have to write your app with this case
> in mind.
>
> Best,
> Dave
>
> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <<t...@gmail.com>
> tijoriwala.ritesh@gmail.com> wrote:
>
>>
>> Hi,
>> I wanted to get details on how does cassandra do synchronous writes to W
>> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>> failures of of nodes before it gets to write to W replicas? If the
>> orchestrating node cannot write to W nodes successfully, I guess it will
>> fail the write operation but what happens to the completed writes on X (W
>> >
>> X) nodes?
>>
>> Thanks,
>> Ritesh
>> --
>> View this message in context:
>> <http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html>
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>> Sent from the <ca...@incubator.apache.org>
>> cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>>
>
>

Re: How does Cassandra handle failure during synchronous writes

Posted by Aaron Morton <aa...@thelastpickle.com>.
In the case described below if less than CL nodes respond in rpc_timeout (from conf yaml) the client will get a timeout error. I think most higher level clients will automatically retry in this case.

If there are not enough nodes to start the request you will get an Unavailable exception. Again the client can retry safely.

Aaron

On 23/02/2011, at 8:07 PM, Dave Revell <da...@meebo-inc.com> wrote:

> Ritesh,
> 
> There is no commit protocol. Writes may be persisted on some replicas even though the quorum fails. Here's a sequence of events that shows the "problem:"
> 
> 1. Some replica R fails, but recently, so its failure has not yet been detected
> 2. A client writes with consistency > 1
> 3. The write goes to all replicas, all replicas except R persist the write to disk
> 4. Replica R never responds
> 5. Failure is returned to the client, but the new value is still in the cluster, on all replicas except R.
> 
> Something very similar could happen for CL QUORUM.
> 
> This is a conscious design decision because a commit protocol would constitute tight coupling between nodes, which goes against the Cassandra philosophy. But unfortunately you do have to write your app with this case in mind.
> 
> Best,
> Dave
> 
> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <ti...@gmail.com> wrote:
> 
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
> 
> Thanks,
> Ritesh
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
> 

Re: How does Cassandra handle failure during synchronous writes

Posted by Dave Revell <da...@meebo-inc.com>.
Ritesh,

There is no commit protocol. Writes may be persisted on some replicas even
though the quorum fails. Here's a sequence of events that shows the
"problem:"

1. Some replica R fails, but recently, so its failure has not yet been
detected
2. A client writes with consistency > 1
3. The write goes to all replicas, all replicas except R persist the write
to disk
4. Replica R never responds
5. Failure is returned to the client, but the new value is still in the
cluster, on all replicas except R.

Something very similar could happen for CL QUORUM.

This is a conscious design decision because a commit protocol would
constitute tight coupling between nodes, which goes against the Cassandra
philosophy. But unfortunately you do have to write your app with this case
in mind.

Best,
Dave

On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
tijoriwala.ritesh@gmail.com> wrote:

>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: How does Cassandra handle failure during synchronous writes

Posted by Jonathan Ellis <jb...@gmail.com>.
This is where things starts getting subtle.

If Cassandra's failure detector knows ahead of time that not enough
writes are available, that is the only time we truly fail a write, and
nothing will be written anywhere.  But if a write starts during the
window where a node is failed but we don't know it yet, then it will
return TimedOutException.

This is commonly called a "failed write" but that is incorrect -- the
write is in progress, but we can't guarantee it's been replicated to
the desired number of replicas.

It's important to note that even in this situation, quorum reads +
writes provide strong consistency.  ("Strong consistency" is defined
as "after an update completes, any subsequent access will return the
updated value.") Quorum eads will be unable to complete as well until
enough machines come back to satisfy the quorum, which is the same
number as needed to finish the write.  So either the original writer
retrying, or the first reader will cause the write to be completed,
after which we're on familiar ground.

Consider the simplest non-trivial quorum, where we are replicating to
nodes X, Y, and Z.  For the case we are interested in, the original
quorum write attempt must time out, so 2 of the 3 replicas (Y and Z)
are temporarily unavailable. The write is applied to one replica (X),
and the client gets a TimedOutException. The write is not failed, it
is not succeeded, it is in progress (and the client should retry,
because it doesn't know for sure that it was applied anywhere at all).

While Y and Z stay down, quorum reads will be rejected.

When they come back up*, a read could achieve a quorum as {X, Y} or
{X, Z} or {Y, Z}.

{Y, Z} is the more interesting case because neither has the new write
yet.  The client will get the old version back, which is fine
according to our contract since the write is still in-progress.  Read
repair will see the new version on X and send it to X and Y.  As soon
as it gets to one of those, the original write is complete, and all
subsequent reads will see the new version.

{X, Y} and {X, Z} are equivalent: one node with the write, and one
without. The read will recognize that X's version needs to be sent to
Z, and the write will be complete.  This read and all subsequent ones
will see the write.  (Z will be replicated to asynchronously via read
repair.)

*If only one comes back up, then you of course only have the {X, Y} or
{X, Z} case.

The important guarantee this gives you is that once one quorum read
sees the new value, all others will too.  You can't see the newest
version, then see an older version on a subsequent write, which is the
characteristic of non-strong consistency (and which you can see in
Cassandra, temporarily, with lower ConsistencyLevels).

On Tue, Feb 22, 2011 at 10:22 PM, tijoriwala.ritesh
<ti...@gmail.com> wrote:
>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com