You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Philippe <wa...@gmail.com> on 2011/08/04 01:00:23 UTC

Write everywhere, read anywhere

Hello,
I have a 3-node, RF=3, cluster configured to write at CL.ALL and read at
CL.ONE. When I take one of the nodes down, writes fail which is what I
expect.
When I run a repair, I see data being streamed from those column families...
that I didn't expect. How can the nodes diverge ? Does this mean that
reading at CL.ONE may return inconsistent data ?

Question 2 : I've doing this rather than CL.QUORUM because I've been
expecting CL.ONE to return data faster than CL.QUORUM. Is that a good
assumption ? Yes, it's ok for writes to be down for a while.

Thanks

Re: Write everywhere, read anywhere

Posted by Mike Malone <mi...@simplegeo.com>.

On Thu, Aug 4, 2011 at 10:25 AM, Jeremiah Jordan <
JEREMIAH.JORDAN@morningstar.com> wrote:

>  If you have RF=3 quorum won’t fail with one node down.  So R/W quorum
> will be consistent in the case of one node down.  If two nodes go down at
> the same time, then you can get inconsistent data from quorum write/read if
> the write fails with TimeOut, the nodes come back up, and then read asks the
> two nodes that were down what the value is.  And another read asks the node
> that was up, and a node that was down.  Those two reads will get different
> answers.
>

So the short answer is: yea, same thing can happen with quorum...

It's true that the failure scenarios are slightly different, but it's not
entirely true that two nodes need to fail to trigger inconsistencies with
quorum. A single node could be partitioned and produce the same result.

If a network event occurs on a single host then any writes that came in
before the event, that are processed before phi evict kicks in and marks the
rest of the cluster unavailable, will be written locally. From the rest of
the cluster's perspective only one node "failed," but from that node's
perspective the entire rest of the cluster failed. Obviously, similar things
could happen with DC_QUORUM if a datacenter went offline.

Mike

RE: Write everywhere, read anywhere

Posted by Jeremiah Jordan <JE...@morningstar.com>.

If you have RF=3 quorum won't fail with one node down.  So R/W quorum will be consistent in the case of one node down.  If two nodes go down at the same time, then you can get inconsistent data from quorum write/read if the write fails with TimeOut, the nodes come back up, and then read asks the two nodes that were down what the value is.  And another read asks the node that was up, and a node that was down.  Those two reads will get different answers.

From: Mike Malone [mailto:mike@simplegeo.com] 
Sent: Thursday, August 04, 2011 12:16 PM
To: user@cassandra.apache.org
Subject: Re: Write everywhere, read anywhere

2011/8/3 Patricio Echagüe <pa...@gmail.com>

On Wed, Aug 3, 2011 at 4:00 PM, Philippe <wa...@gmail.com> wrote:

Hello,

I have a 3-node, RF=3, cluster configured to write at CL.ALL and read at CL.ONE. When I take one of the nodes down, writes fail which is what I expect.

When I run a repair, I see data being streamed from those column families... that I didn't expect. How can the nodes diverge ? Does this mean that reading at CL.ONE may return inconsistent data ?

we abort the mutation before hand when there are enough replicas alive. If a mutation went through and in the middle of it a replica goes down, in that case you can write to some nodes and the request will Timeout.

In that case the CL.ONE may return inconsistence data. 

Doesn't CL.QUORUM suffer from the same problem? There's no isolation or rollback with CL.QUORUM either. So if I do a quorum write with RF=3 and it fails after hitting a single node, a subsequent quorum read could return the old data (if it hits the two nodes that didn't receive the write) or the new data that failed mid-write (if it hits the node that did receive the write).

Basically, the scenarios where CL.ALL + CL.ONE results in a read of inconsistent data could also cause a CL.QUORUM write followed by a CL.QUORUM read to return inconsistent data. Right? The problem (if there is one) is that even in the quorum case columns with the most recent timestamp win during repair resolution, not columns that have quorum consensus.

Mike

Re: Write everywhere, read anywhere

Posted by Mike Malone <mi...@simplegeo.com>.

2011/8/3 Patricio Echagüe <pa...@gmail.com>

>
>
> On Wed, Aug 3, 2011 at 4:00 PM, Philippe <wa...@gmail.com> wrote:
>
>> Hello,
>> I have a 3-node, RF=3, cluster configured to write at CL.ALL and read at
>> CL.ONE. When I take one of the nodes down, writes fail which is what I
>> expect.
>> When I run a repair, I see data being streamed from those column
>> families... that I didn't expect. How can the nodes diverge ? Does this mean
>> that reading at CL.ONE may return inconsistent data ?
>>
>
> we abort the mutation before hand when there are enough replicas alive. If
> a mutation went through and in the middle of it a replica goes down, in that
> case you can write to some nodes and the request will Timeout.
> In that case the CL.ONE may return inconsistence data.
>

Doesn't CL.QUORUM suffer from the same problem? There's no isolation or
rollback with CL.QUORUM either. So if I do a quorum write with RF=3 and it
fails after hitting a single node, a subsequent quorum read could return the
old data (if it hits the two nodes that didn't receive the write) or the new
data that failed mid-write (if it hits the node that did receive the write).

Basically, the scenarios where CL.ALL + CL.ONE results in a read of
inconsistent data could also cause a CL.QUORUM write followed by a CL.QUORUM
read to return inconsistent data. Right? The problem (if there is one) is
that even in the quorum case columns with the most recent timestamp win
during repair resolution, not columns that have quorum consensus.

Mike

Re: Write everywhere, read anywhere

Posted by Patricio Echagüe <pa...@gmail.com>.

On Wed, Aug 3, 2011 at 4:00 PM, Philippe <wa...@gmail.com> wrote:

> Hello,
> I have a 3-node, RF=3, cluster configured to write at CL.ALL and read at
> CL.ONE. When I take one of the nodes down, writes fail which is what I
> expect.
> When I run a repair, I see data being streamed from those column
> families... that I didn't expect. How can the nodes diverge ? Does this mean
> that reading at CL.ONE may return inconsistent data ?
>

we abort the mutation before hand when there are enough replicas alive. If a
mutation went through and in the middle of it a replica goes down, in that
case you can write to some nodes and the request will Timeout.
In that case the CL.ONE may return inconsistence data.

>
> Question 2 : I've doing this rather than CL.QUORUM because I've been
> expecting CL.ONE to return data faster than CL.QUORUM. Is that a good
> assumption ? Yes, it's ok for writes to be down for a while.
>

When you hit a node that own the piece of data, CL.ONE will be faster as you
don't have to wait for a read across the network to reach another node.
For CL.QUORUM we fire reads in parallel to all the replicas and wait until
completing quorum. If I'm not wrong, in some cases the difference may be
negligible for CL.ONE and CL.QUORUM when you hit a coordinator that doesn't
own the data since you are going  over the network anyway (assuming all
nodes take the same time to reply)

>
> Thanks
>