You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Thorsten von Eicken <tv...@rightscale.com> on 2012/01/30 18:51:58 UTC

recovering from network partition

I'm trying to work through various failure modes to figure out the
proper operating procedure and proper client coding practices. I'm a
little unclear about what happens when a network partition gets
repaired. Take the following scenario:
 - cluster with 5 nodes: A thru E; RF = 3; read_cf = 1; write_cf = 1
 - network partition divides A-C off from D-E
 - operation continues on both sides, obviously some data is unavailable
from D-E
 - hinted handoffs accumulate

Now the network partition is repaired. The question I have is what is
the sequencing of events, in particular between processing HH and
forwarding read requests across the former partition. I'm hoping that
there is a time period to process HH *before* nodes forward requests.
E.g. it would be really good for A not to forward read requests to D
until D is done with HH processing. Otherwise, clients of A may see a
discontinuity where data that was available during the partition see it
go away and then come back.

Is there a manual or wiki section that discusses some of this and I just
missed it?

Re: recovering from network partition

Posted by aaron morton <aa...@thelastpickle.com>.

If you are working at CF ONE you are accepting that *any* value for a key+col combination stored on a replica for a row is a valid response, and that includes no value.

After the nodes have detected the others are UP they will start their HH in a staggered fashion, and will rate limit themselves to avoid overwhelming the node. It may take some time to complete. 

>  Otherwise, clients of A may see a
> discontinuity where data that was available during the partition see it
> go away and then come back.
If you are concerned about reads been consistent, then use CL QUORUM.

If you are reading at CL ONE (in 1.0* ) the read will go one replica 90%  of the time, and you will only get the result from that one replica. Which may be any value the key+col has been set to including no value. 

The other 10% of the time Read Repair will kick in (this is the configured value for read_repair_chance in 1.0, you can change this value). The purpose of RR is to make is so that the next time a read happens the data is consistent. So reading the CL ONE the read will go to all nodes, you will get a response from one and only one of them. In the background the responses from the others will be checked and consistency repaired. 

If you were working at a higher CL the responses from CL nodes are checked as part of the read request, synchronous to the read, and you get a consistent result from all nodes. RR may still run in the background and CL nodes may be less than RF nodes.

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 31/01/2012, at 6:51 AM, Thorsten von Eicken wrote:

> I'm trying to work through various failure modes to figure out the
> proper operating procedure and proper client coding practices. I'm a
> little unclear about what happens when a network partition gets
> repaired. Take the following scenario:
> - cluster with 5 nodes: A thru E; RF = 3; read_cf = 1; write_cf = 1
> - network partition divides A-C off from D-E
> - operation continues on both sides, obviously some data is unavailable
> from D-E
> - hinted handoffs accumulate
> 
> Now the network partition is repaired. The question I have is what is
> the sequencing of events, in particular between processing HH and
> forwarding read requests across the former partition. I'm hoping that
> there is a time period to process HH *before* nodes forward requests.
> E.g. it would be really good for A not to forward read requests to D
> until D is done with HH processing. Otherwise, clients of A may see a
> discontinuity where data that was available during the partition see it
> go away and then come back.
> 
> Is there a manual or wiki section that discusses some of this and I just
> missed it?
>