You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Vasileios Vlachos <va...@gmail.com> on 2012/12/19 17:07:00 UTC

Replication Factor and Consistency Level Confusion

Hello All,

We have a 3-node cluster and we created a keyspace (say Test_1) with
Replication Factor set to 3. I know is not great but we wanted to test
different behaviors. So, we created a Column Family (say cf_1) and we
tried writing something with Consistency Level ANY, ONE, TWO, THREE,
QUORUM and ALL. We did that while all nodes were in UP state, so we
had no problems at all. No matter what the Consistency Level was, we
were able to insert a value.

Same cluster, different keyspace (say Test_2) with Replication Factor
set to 2 this time and one of the 3 nodes deliberately DOWN. Again, we
created a Column Family (say cf_1) and we tried writing something with
different Consistency Levels. Here is what we got:
ANY: worked (expected...)
ONE: worked (expected...)
TWO: did not work (WHAAAAT???)
THREE: did not work (expected...)
QUORUM: worked (expected...)
ALL: did not work (expected I guess...)

Now, we know that QUORUM derives from (RF/2)+1, so we were expecting
that to work, after all only 1 node was DOWN. Why did Consistency
Level TWO not work then???

Third test... Same cluster again, different keyspace (say Test_3) with
Replication Factor set to 3 this time and 1 of the 3 nodes
deliberately DOWN again. Same approach again, created different Column
Family (say cf_1) and different Consistency Level settings resulted in
the following:
ANY: worked (whaaaaat???)
ONE: worked (whaaaaat???)
TWO: did not work (whaaaaat???)
THREE: did not work (expected...)
QUORUM: worked (whaaaaat???)
ALL: worked (whaaaaat???)

We thought that if the Replication Factor is greater than the number
of nodes in the cluster, writes are blocked.

Apparently we are completely missing the a level of understanding
here, so we would appreciate any help!

Thank you in advance!

Vasilis

Re: Replication Factor and Consistency Level Confusion

Posted by aaron morton <aa...@thelastpickle.com>.

>> this actually what is happening, how is it possible to ever have a
>> node-failure resiliant cassandra cluster? 
Background http://thelastpickle.com/2011/06/13/Down-For-Me/

> I would suggest double-checking your test setup; also, make sure you
> use the same row keys every time (if this is not already the case) so
> that you have repeatable results.
Take a look at the nodetool getendpoints command. It will tell you which nodes a key is stored on. Though for RF 3 and N3 it's all of them :)

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/12/2012, at 12:54 AM, Tristan Seligmann <mi...@mithrandi.net> wrote:

> On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos
> <va...@gmail.com> wrote:
>> Initially we were thinking the same thing, that an explanation would
>> be that the "wrong" node could be down, but then isn't this something
>> that hinted handoff sorts out?
> 
> If a node is partitioned from the rest of the cluster (ie. the node
> goes down, but later comes back with the same data it had), it will
> obviously be out of data with regard to any writes that happened while
> it was down. Anti-entropy (nodetool repair) and read repair will
> repair this inconsistency over time, but not right away; hinted
> handoff is an optimization that will allow the node to become mostly
> consistent right away on rejoining the cluster, as the nodes will have
> stored hints for it while it was down, and will send it them once the
> node is back up.
> 
> However, the important thing to note is that this is an
> /optimization/. If a replica is down, then it will not be able to
> satisfy any consistency level requirements, except for the special
> case of CL=ANY. If you use another CL like TWO, then two actual
> replica nodes must be up for the ranges you are writing to, a node
> that is not a replica but will write a hint does not count.
> 
>> Test 2 (2/3 Nodes UP):
>> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
>> RF 2:    OK     OK     x      x        OK        x
> 
> For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should
> have succeded if both of the replicas for the range were up, but if
> one of the replicas for the range was the downed node, then it would
> have failed. I think you can use the 'nodetool getendpoints' command
> to list the nodes that are replicas for the given row key.
> 
> I am unable to explain how a write at QUORUM could succeed if a write
> at TWO for the same key failed.
> 
>> Test 3 (2/3 Nodes UP):
>> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
>> RF 3:    OK     OK     x      x        OK        OK
> 
> For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to
> explain why a write at QUORUM would succeed if a write at TWO failed,
> and I am also unable to explain how a write at ALL could succeed, for
> any key, if one of the nodes is down.
> 
> I would suggest double-checking your test setup; also, make sure you
> use the same row keys every time (if this is not already the case) so
> that you have repeatable results.
> 
>> Furthermore, with regards to being "unlucky" with the "wrong node" if
>> this actually what is happening, how is it possible to ever have a
>> node-failure resiliant cassandra cluster? My understanding of this
>> implies that even with 100 nodes, every 1/100 writes would fail until
>> the node is replaced/repaired.
> 
> RF is the important number when considering fault-tolerance in your
> cluster, not the number of nodes. If RF=3, and you read and write at
> quorum, then you can tolerate one node being down in the range you are
> operating on. If you need to be able to tolerate two nodes being down,
> RF=5 and QUORUM would work. In other words, if you need better fault
> tolerance, RF is what you need to increase; if you need better
> performance, or you need to store more data, then N (number of nodes
> in cluster) is what you need to increase. Of course, N must be at
> least as big as RF...
> --
> mithrandi, i Ainil en-Balandor, a faer Ambar

Re: Replication Factor and Consistency Level Confusion

Posted by Tristan Seligmann <mi...@mithrandi.net>.

On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos
<va...@gmail.com> wrote:
> Initially we were thinking the same thing, that an explanation would
> be that the "wrong" node could be down, but then isn't this something
> that hinted handoff sorts out?

If a node is partitioned from the rest of the cluster (ie. the node
goes down, but later comes back with the same data it had), it will
obviously be out of data with regard to any writes that happened while
it was down. Anti-entropy (nodetool repair) and read repair will
repair this inconsistency over time, but not right away; hinted
handoff is an optimization that will allow the node to become mostly
consistent right away on rejoining the cluster, as the nodes will have
stored hints for it while it was down, and will send it them once the
node is back up.

However, the important thing to note is that this is an
/optimization/. If a replica is down, then it will not be able to
satisfy any consistency level requirements, except for the special
case of CL=ANY. If you use another CL like TWO, then two actual
replica nodes must be up for the ranges you are writing to, a node
that is not a replica but will write a hint does not count.

> Test 2 (2/3 Nodes UP):
> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
> RF 2:    OK     OK     x      x        OK        x

For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should
have succeded if both of the replicas for the range were up, but if
one of the replicas for the range was the downed node, then it would
have failed. I think you can use the 'nodetool getendpoints' command
to list the nodes that are replicas for the given row key.

I am unable to explain how a write at QUORUM could succeed if a write
at TWO for the same key failed.

> Test 3 (2/3 Nodes UP):
> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
> RF 3:    OK     OK     x      x        OK        OK

For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to
explain why a write at QUORUM would succeed if a write at TWO failed,
and I am also unable to explain how a write at ALL could succeed, for
any key, if one of the nodes is down.

I would suggest double-checking your test setup; also, make sure you
use the same row keys every time (if this is not already the case) so
that you have repeatable results.

> Furthermore, with regards to being "unlucky" with the "wrong node" if
> this actually what is happening, how is it possible to ever have a
> node-failure resiliant cassandra cluster? My understanding of this
> implies that even with 100 nodes, every 1/100 writes would fail until
> the node is replaced/repaired.

RF is the important number when considering fault-tolerance in your
cluster, not the number of nodes. If RF=3, and you read and write at
quorum, then you can tolerate one node being down in the range you are
operating on. If you need to be able to tolerate two nodes being down,
RF=5 and QUORUM would work. In other words, if you need better fault
tolerance, RF is what you need to increase; if you need better
performance, or you need to store more data, then N (number of nodes
in cluster) is what you need to increase. Of course, N must be at
least as big as RF...
--
mithrandi, i Ainil en-Balandor, a faer Ambar

Re: Replication Factor and Consistency Level Confusion

Posted by Henrik Schröder <sk...@gmail.com>.

Don't run with a replication factor of 2, use 3 instead, and do all reads
and writes using quorum consistency.

That way, if a single node is down, all your operations will complete. In
fact, if every third node is down, you'll still be fine and able to handle
all requests.

However, if two adjacent nodes are down at the same time, operations
against keys that are stored on both those servers will fail beause quorum
can't be satisfied.

To gain a better understanding, repeat your tests, but with multiple random
keys, and keep track of how many operations fail in each case.

/Henrik

On Thu, Dec 20, 2012 at 10:26 AM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Furthermore, with regards to being "unlucky" with the "wrong node" if
> this actually what is happening, how is it possible to ever have a
> node-failure resiliant cassandra cluster? My understanding of this
> implies that even with 100 nodes, every 1/100 writes would fail until
> the node is replaced/repaired.
>

Re: Replication Factor and Consistency Level Confusion

Posted by Vasileios Vlachos <va...@gmail.com>.

Hello,

Thank you very much for your quick responses.

Initially we were thinking the same thing, that an explanation would
be that the "wrong" node could be down, but then isn't this something
that hinted handoff sorts out? So actually, Consistency Level refers
to the number of replicas, not the total number of nodes in a cluster.
Keeping that in mind and assuming that hinted handoff has nothing to
do with that as I thought, I could explain some results but not all.
Let me explain:

Test 1 (3/3 Nodes UP):
CL  :    ANY     ONE    TWO    THREE    QUORUM   ALL
RF 3:    OK      OK     OK     OK       OK       OK

Test 2 (2/3 Nodes UP):
CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
RF 2:    OK     OK     x      x        OK        x

Test 3 (2/3 Nodes UP):
CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
RF 3:    OK     OK     x      x        OK        OK

Test 1:
Everything was fine because all nodes were up and the RF does not
exceed the total number of nodes, in which case writes would be
blocked.

Test 2:
CL=TWO did not work because we were "unlucky" and the "wrong" node,
responsible for the key range we were trying to insert, was DOWN (I
can accept that for now, however I do not quite understand why isn't
this sorted by the hinted handoff). My explanation might be wrong
again, but CL=THREE should fail because we only have set RF=2, so
there isn't a 3rd replica anywhere anyway. Why did CL=QUORUM not fail
then? Since QUORUM=(RF/2)+1=2 in this case, the write operation should
try to write in 2 replicas, one of which, the one responsible for that
range as we said, is DOWN. I should expect CL=2 and CL=QUORUM to have
the same outcome in this case. Why that's not the case? CL=ALL fails
for the same reason as CL=TWO I presume.

Test 3:
I was expecting only CL=ANY and CL=ONE to work in this case. CL=TWO
does not work because , just like with Test 2, the same situation
applies with the node responsible for that particular key range to be
DOWN. If that's the case, why CL=QUORUM was successful??? The only
explanation I can thing of at the moment is that QUORUM explicitly
refers to the total number of nodes in the cluster rather than the
number of replicas determined by the RF. CL=THREE seems easy, it fails
because one of the three replicas is DOWN. CL=ALL is confusing as
well. If my understanding is correct and ALL means all replicas, 3 in
this case, then the operation should fail because one replica is DOWN
and I can not be "lucky" to have the right node DOWN, because RF=3.
So, every node should have a copy of the data.

Furthermore, with regards to being "unlucky" with the "wrong node" if
this actually what is happening, how is it possible to ever have a
node-failure resiliant cassandra cluster? My understanding of this
implies that even with 100 nodes, every 1/100 writes would fail until
the node is replaced/repaired.

Thank you very much in advance.

Vasilis

On Wed, Dec 19, 2012 at 4:18 PM, Roland Gude <ro...@ez.no> wrote:
>
> Hi
>
> RF 2 means that 2 nodes are responsible for any given row (no matter how
> many nodes are in the cluster)
> For your cluster with three nodes let's just assume the following
> responsibilities
>
> Node            A               B               C
> Primary keys    0-5             6-10            11-15
> Replica keys    11-15           0-5             6-10
>
> Assume node 'C' is down
> Writing any key in range 0-5 with consistency TWO is possible (A and B are
> up)
> Writing any key in range 11-15 with consistency TWO will fail (C is down
> and 11-15 is its primary range)
> Writing any key in range 6-10 with consistency TWO will fail (C is down
> and it is the replica for this range)
>
> I hope this explains it.
>
> -----Ursprüngliche Nachricht-----
> Von: Vasileios Vlachos [mailto:vasileiosvlachos@gmail.com]
> Gesendet: Mittwoch, 19. Dezember 2012 17:07
> An: user@cassandra.apache.org
> Betreff: Replication Factor and Consistency Level Confusion
>
> Hello All,
>
> We have a 3-node cluster and we created a keyspace (say Test_1) with
> Replication Factor set to 3. I know is not great but we wanted to test
> different behaviors. So, we created a Column Family (say cf_1) and we tried
> writing something with Consistency Level ANY, ONE, TWO, THREE, QUORUM and
> ALL. We did that while all nodes were in UP state, so we had no problems at
> all. No matter what the Consistency Level was, we were able to insert a
> value.
>
> Same cluster, different keyspace (say Test_2) with Replication Factor set
> to 2 this time and one of the 3 nodes deliberately DOWN. Again, we created a
> Column Family (say cf_1) and we tried writing something with different
> Consistency Levels. Here is what we got:
> ANY: worked (expected...)
> ONE: worked (expected...)
> TWO: did not work (WHAAAAT???)
> THREE: did not work (expected...)
> QUORUM: worked (expected...)
> ALL: did not work (expected I guess...)
>
> Now, we know that QUORUM derives from (RF/2)+1, so we were expecting that
> to work, after all only 1 node was DOWN. Why did Consistency Level TWO not
> work then???
>
> Third test... Same cluster again, different keyspace (say Test_3) with
> Replication Factor set to 3 this time and 1 of the 3 nodes deliberately DOWN
> again. Same approach again, created different Column Family (say cf_1) and
> different Consistency Level settings resulted in the following:
> ANY: worked (whaaaaat???)
> ONE: worked (whaaaaat???)
> TWO: did not work (whaaaaat???)
> THREE: did not work (expected...)
> QUORUM: worked (whaaaaat???)
> ALL: worked (whaaaaat???)
>
> We thought that if the Replication Factor is greater than the number of
> nodes in the cluster, writes are blocked.
>
> Apparently we are completely missing the a level of understanding here, so
> we would appreciate any help!
>
> Thank you in advance!
>
> Vasilis
>
>

AW: Replication Factor and Consistency Level Confusion

Posted by Roland Gude <ro...@ez.no>.

Hi

RF 2 means that 2 nodes are responsible for any given row (no matter how many nodes are in the cluster)
For your cluster with three nodes let's just assume the following responsibilities

Node		A		B		C
Primary keys	0-5		6-10		11-15
Replica keys	11-15		0-5		6-10

Assume node 'C' is down
Writing any key in range 0-5 with consistency TWO is possible (A and B are up)
Writing any key in range 11-15 with consistency TWO will fail (C is down and 11-15 is its primary range)
Writing any key in range 6-10 with consistency TWO will fail (C is down and it is the replica for this range)

I hope this explains it.

-----Ursprüngliche Nachricht-----
Von: Vasileios Vlachos [mailto:vasileiosvlachos@gmail.com] 
Gesendet: Mittwoch, 19. Dezember 2012 17:07
An: user@cassandra.apache.org
Betreff: Replication Factor and Consistency Level Confusion

Hello All,

We have a 3-node cluster and we created a keyspace (say Test_1) with Replication Factor set to 3. I know is not great but we wanted to test different behaviors. So, we created a Column Family (say cf_1) and we tried writing something with Consistency Level ANY, ONE, TWO, THREE, QUORUM and ALL. We did that while all nodes were in UP state, so we had no problems at all. No matter what the Consistency Level was, we were able to insert a value.

Same cluster, different keyspace (say Test_2) with Replication Factor set to 2 this time and one of the 3 nodes deliberately DOWN. Again, we created a Column Family (say cf_1) and we tried writing something with different Consistency Levels. Here is what we got:
ANY: worked (expected...)
ONE: worked (expected...)
TWO: did not work (WHAAAAT???)
THREE: did not work (expected...)
QUORUM: worked (expected...)
ALL: did not work (expected I guess...)

Now, we know that QUORUM derives from (RF/2)+1, so we were expecting that to work, after all only 1 node was DOWN. Why did Consistency Level TWO not work then???

Third test... Same cluster again, different keyspace (say Test_3) with Replication Factor set to 3 this time and 1 of the 3 nodes deliberately DOWN again. Same approach again, created different Column Family (say cf_1) and different Consistency Level settings resulted in the following:
ANY: worked (whaaaaat???)
ONE: worked (whaaaaat???)
TWO: did not work (whaaaaat???)
THREE: did not work (expected...)
QUORUM: worked (whaaaaat???)
ALL: worked (whaaaaat???)

We thought that if the Replication Factor is greater than the number of nodes in the cluster, writes are blocked.

Apparently we are completely missing the a level of understanding here, so we would appreciate any help!

Thank you in advance!

Vasilis

Re: Replication Factor and Consistency Level Confusion

Posted by "Hiller, Dean" <De...@nrel.gov>.

ANY: worked (expected...)
ONE: worked (expected...)
TWO: did not work (WHAAAAT???)

This is expected sometimes and sometimes not.  It depends on the 2 of the
3 nodes that have the data.  Since you have one node down, that might be
the one where that data goes ;).

THREE: did not work (expected...)
QUORUM: worked (expected...)
ALL: did not work (expected I guess...)

On 12/19/12 9:07 AM, "Vasileios Vlachos" <va...@gmail.com>
wrote:

>ANY: worked (expected...)
>ONE: worked (expected...)
>TWO: did not work (WHAAAAT???)
>THREE: did not work (expected...)
>QUORUM: worked (expected...)
>ALL: did not work (expected I guess...)

Re: Replication Factor and Consistency Level Confusion

Posted by "Hiller, Dean" <De...@nrel.gov>.

Ps, you may be getting a bit confused by the way.  Just think if you have
a 10 node cluster and one node is down  and you do CL=2Š..if the node that
is down is where your data goes, yes, you will fail.  If you do CL=quorum
and RF=3 you can tolerate one node being downŠIf you use astyanax, I think
they have an implementation that will switch the CL level to lower when a
node is out so the writes will still work which is quite nice.  (ie. Write
fails, switch to CL lower and write again or same with the read).

Dean

On 12/19/12 9:07 AM, "Vasileios Vlachos" <va...@gmail.com>
wrote:

>Hello All,
>
>We have a 3-node cluster and we created a keyspace (say Test_1) with
>Replication Factor set to 3. I know is not great but we wanted to test
>different behaviors. So, we created a Column Family (say cf_1) and we
>tried writing something with Consistency Level ANY, ONE, TWO, THREE,
>QUORUM and ALL. We did that while all nodes were in UP state, so we
>had no problems at all. No matter what the Consistency Level was, we
>were able to insert a value.
>
>Same cluster, different keyspace (say Test_2) with Replication Factor
>set to 2 this time and one of the 3 nodes deliberately DOWN. Again, we
>created a Column Family (say cf_1) and we tried writing something with
>different Consistency Levels. Here is what we got:
>ANY: worked (expected...)
>ONE: worked (expected...)
>TWO: did not work (WHAAAAT???)
>THREE: did not work (expected...)
>QUORUM: worked (expected...)
>ALL: did not work (expected I guess...)
>
>Now, we know that QUORUM derives from (RF/2)+1, so we were expecting
>that to work, after all only 1 node was DOWN. Why did Consistency
>Level TWO not work then???
>
>Third test... Same cluster again, different keyspace (say Test_3) with
>Replication Factor set to 3 this time and 1 of the 3 nodes
>deliberately DOWN again. Same approach again, created different Column
>Family (say cf_1) and different Consistency Level settings resulted in
>the following:
>ANY: worked (whaaaaat???)
>ONE: worked (whaaaaat???)
>TWO: did not work (whaaaaat???)
>THREE: did not work (expected...)
>QUORUM: worked (whaaaaat???)
>ALL: worked (whaaaaat???)
>
>We thought that if the Replication Factor is greater than the number
>of nodes in the cluster, writes are blocked.
>
>Apparently we are completely missing the a level of understanding
>here, so we would appreciate any help!
>
>Thank you in advance!
>
>Vasilis