You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kyrylo Lebediev <Ky...@epam.com> on 2018/01/15 15:04:44 UTC

vnodes: high availability

Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?


Regards,

Kyrill

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Kurt, thanks for your recommendations.

Make sense.


Yes, we're planning to migrate the cluster and change endpoint-snitch to "AZ-aware" one.

Unfortunately, I'm 'not good enough' in math, have to think of how to calculate probabilities for the case of vnodes (whereas the case "without vnodes" should be easy to calculate: just a bit of combinatorics). Not an easy task for me, but will try to get at least some estimations.

Still believe that having formulas (results of doing math) we could come up with 'better' best-practices than are currently stated in C* documentation.


--

In particular, as far as I understand, probsbility of losing a keyrange [for CL=QUORUM] for a cluster with vnodes=256 and SimpleSnitch and total number of physical nodes not much more than 256 [not a lot of such large clusters..] equals to:

P1 = C(Nnodes, 2)*p^2 = Nnodes*(Nnodes-1)/2  *p^2


where :

p - failure probability for a server,

C(Nnodes, 2) - combination any 2 nodes  [https://en.wikipedia.org/wiki/Combination]


Whereas probability of losing a keyrange for a non-vnode cluster is:

P2 = 2*Nnodes*p^2


So, 'old good' non-vnodes cluster is more reliable than 'new-style' vnodes cluster.

Correct?


Would like to get similar results for more realistic cases.  Will be back here once I get them (hoping to get....)


Regards,

Kyrill


Combination - Wikipedia<https://en.wikipedia.org/wiki/Combination>
en.wikipedia.org
In mathematics, a combination is a selection of items from a collection, such that (unlike permutations) the order of selection does not matter.






________________________________
From: kurt greaves <ku...@instaclustr.com>
Sent: Wednesday, January 17, 2018 5:43:06 AM
To: User
Subject: Re: vnodes: high availability

Even with a low amount of vnodes you're asking for a bad time. Even if you managed to get down to 2 vnodes per node, you're still likely to include double the amount of nodes in any streaming/repair operation which will likely be very problematic for incremental repairs, and you still won't be able to easily reason about which nodes are responsible for which token ranges. It's still quite likely that a loss of 2 nodes would mean some portion of the ring is down (at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens if you can; a lot of work still needs to be done to ensure smooth operation of C* while using vnodes, and they are much more difficult to reason about (which is probably the reason no one has bothered to do the math). If you're really keen on the math your best bet is to do it yourself, because it's not a point of interest for many C* devs plus probably a lot of us wouldn't remember enough math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 3.0.

Based on your info and comments in the ticket it's really a bad idea to have small number of vnodes for the versions using old allocation method because of hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org>
It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic ...


________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 8:21:33 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill
________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.

Regards,
Kyrill
________________________________
From: kurt greaves <ku...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,
Kyrill
________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Hi,

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?

Regards,
Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Hannu Kröger <hk...@gmail.com>.

If this is a universal recommendation, then should that actually be default in Cassandra? 

Hannu

> On 18 Jan 2018, at 00:49, Jon Haddad <jo...@jonhaddad.com> wrote:
> 
> I *strongly* recommend disabling dynamic snitch.  I’ve seen it make latency jump 10x.  
> 
> dynamic_snitch: false is your friend.
> 
> 
> 
>> On Jan 17, 2018, at 2:00 PM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>> 
>> Avi, 
>> If we prefer to have better balancing [like absence of hotspots during a node down event etc], large number of vnodes is a good solution.
>> Personally, I wouldn't prefer any balancing over overall resiliency  (and in case of non-optimal setup, larger number of nodes in a cluster decreases overall resiliency, as far as I understand.) 
>> 
>> Talking about hotspots, there is a number of features helping to mitigate the issue, for example:
>>   - dynamic snitch [if a node overloaded it won't be queried]
>>   - throttling of streaming operations
>> 
>> Thanks, 
>> Kyrill
>> 
>> From: Avi Kivity <avi@scylladb.com <ma...@scylladb.com>>
>> Sent: Wednesday, January 17, 2018 2:50 PM
>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>; kurt greaves
>> Subject: Re: vnodes: high availability
>>  
>> On the flip side, a large number of vnodes is also beneficial. For example, if you add a node to a 20-node cluster with many vnodes, each existing node will contribute 5% of the data towards the new node, and all nodes will participate in streaming (meaning the impact on any single node will be limited, and completion time will be faster).
>> 
>> With a low number of vnodes, only a few nodes participate in streaming, which means that the cluster is left unbalanced and the impact on each streaming node is greater (or that completion time is slower).
>> 
>> Similarly, with a high number of vnodes, if a node is down its work is distributed equally among all nodes. With a low number of vnodes the cluster becomes unbalanced.
>> 
>> Overall I recommend high vnode count, and to limit the impact of failures in other ways (smaller number of large nodes vs. larger number of small nodes).
>> 
>> btw, rack-aware topology improves the multi-failure problem but at the cost of causing imbalance during maintenance operations. I recommend using rack-aware topology only if you really have racks with single-points-of-failure, not for other reasons.
>> 
>> On 01/17/2018 05:43 AM, kurt greaves wrote:
>>> Even with a low amount of vnodes you're asking for a bad time. Even if you managed to get down to 2 vnodes per node, you're still likely to include double the amount of nodes in any streaming/repair operation which will likely be very problematic for incremental repairs, and you still won't be able to easily reason about which nodes are responsible for which token ranges. It's still quite likely that a loss of 2 nodes would mean some portion of the ring is down (at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens if you can; a lot of work still needs to be done to ensure smooth operation of C* while using vnodes, and they are much more difficult to reason about (which is probably the reason no one has bothered to do the math). If you're really keen on the math your best bet is to do it yourself, because it's not a point of interest for many C* devs plus probably a lot of us wouldn't remember enough math to know how to approach it.
>>> 
>>> If you want to get out of this situation you'll need to do a DC migration to a new DC with a better configuration of snitch/replication strategy/racks/tokens.
>>> 
>>> 
>>> On 16 January 2018 at 21:54, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>> Thank you for this valuable info, Jon.
>>> I guess both you and Alex are referring to improved vnodes allocation method  https://issues.apache.org/jira/browse/CASSANDRA-7032 <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was implemented in 3.0.
>>> Based on your info and comments in the ticket it's really a bad idea to have small number of vnodes for the versions using old allocation method because of hot-spots, so it's not an option for my particular case (v.2.1) :( 
>>> 
>>> [As far as I can see from the source code this new method wasn't backported to 2.1.]
>>> 
>>> 
>>> Regards, 
>>> Kyrill
>>> [CASSANDRA-7032] Improve vnode allocation - ASF JIRA <https://issues.apache.org/jira/browse/CASSANDRA-7032>
>>> issues.apache.org <http://issues.apache.org/>
>>> It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic ...
>>> 
>>> From: Jon Haddad <jonathan.haddad@gmail.com <ma...@gmail.com>> on behalf of Jon Haddad <jon@jonhaddad.com <ma...@jonhaddad.com>>
>>> Sent: Tuesday, January 16, 2018 8:21:33 PM
>>> 
>>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>>> Subject: Re: vnodes: high availability
>>>  
>>> We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.
>>> 
>>> I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?
>>> 
>>> Jon
>>> 
>>>> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>> 
>>>> Agree with you, Jon.
>>>> Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
>>>> We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.
>>>> 
>>>> Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .
>>>> 
>>>> Jon,  
>>>> Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1? 
>>>> 
>>>> vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
>>>> vnodes=4 seems to be better from HA + balancing trade-off
>>>> 
>>>> Thanks, 
>>>> Kyrill
>>>> From: Jon Haddad <jonathan.haddad@gmail.com <ma...@gmail.com>> on behalf of Jon Haddad <jon@jonhaddad.com <ma...@jonhaddad.com>>
>>>> Sent: Tuesday, January 16, 2018 6:44:53 PM
>>>> To: user
>>>> Subject: Re: vnodes: high availability
>>>>  
>>>> While all the token math is helpful, I have to also call out the elephant in the room:
>>>> 
>>>> You have not correctly configured Cassandra for production.
>>>> 
>>>> If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE. 
>>>> 
>>>> You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.
>>>> 
>>>> 
>>>> 
>>>>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>>> 
>>>>> ...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
>>>>> More nodes in a cluster means higher probability of simultaneous node failures.
>>>>> And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.
>>>>> 
>>>>> Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
>>>>> In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.  
>>>>> 
>>>>> Regards, 
>>>>> Kyrill
>>>>> From: kurt greaves <kurt@instaclustr.com <ma...@instaclustr.com>>
>>>>> Sent: Tuesday, January 16, 2018 2:16:34 AM
>>>>> To: User
>>>>> Subject: Re: vnodes: high availability
>>>>>  
>>>>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>>>>> 
>>>>> If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM. 
>>>>> 
>>>>> Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.
>>>>> 
>>>>> On 15 January 2018 at 19:59, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>>> Thanks Alexander!
>>>>> 
>>>>> I'm not a MS in math too) Unfortunately.
>>>>> 
>>>>> Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).
>>>>> 
>>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>>> of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM). 
>>>>> 
>>>>> That's because vnodes_per_node=8 > Nnodes=6.
>>>>> As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
>>>>> Please, correct me if I'm wrong.
>>>>> 
>>>>> How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node? 
>>>>> 
>>>>> Regards, 
>>>>> Kyrill
>>>>> 
>>>>> From: Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>>
>>>>> Sent: Monday, January 15, 2018 8:14:21 PM
>>>>> 
>>>>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>>>>> Subject: Re: vnodes: high availability
>>>>>  
>>>>> I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
>>>>> The more vnodes, the smaller the chunks of data you may lose, and vice versa.
>>>>> I officially suck at statistics, as expected :)
>>>>> 
>>>>> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>> a écrit :
>>>>> Hi Kyrylo,
>>>>> 
>>>>> the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
>>>>> If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.
>>>>> 
>>>>> I kinda suck at maths but I'm going to risk making a fool of myself :)
>>>>> 
>>>>> The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
>>>>> Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
>>>>> Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.
>>>>> 
>>>>> Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
>>>>> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)
>>>>> 
>>>>> How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.
>>>>> 
>>>>> Then you have rack awareness that comes with NetworkTopologyStrategy : 
>>>>> If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
>>>>> In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
>>>>> If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).
>>>>> 
>>>>> That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.
>>>>> 
>>>>> Feel free to correct my numbers if I'm wrong.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>>> Thanks, Rahul.
>>>>> But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.
>>>>> 
>>>>> As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:
>>>>> 
>>>>> 1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
>>>>> 2) there are a lot of vnodes on each node
>>>>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>>>>> 
>>>>> we get all physical nodes (servers) having mutually adjacent  token rages.
>>>>> Is it correct?
>>>>> 
>>>>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:
>>>>> 
>>>>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l
>>>>> 
>>>>> returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>>>>> 
>>>>> Thanks,
>>>>> Kyrill
>>>>> From: Rahul Neelakantan <rahul@rahul.be <ma...@rahul.be>>
>>>>> Sent: Monday, January 15, 2018 5:20:20 PM
>>>>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>>>>> Subject: Re: vnodes: high availability
>>>>>  
>>>>> Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram 
>>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>>> 
>>>>> In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.
>>>>> 
>>>>> You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode
>>>>> 
>>>>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>>>>> 
>>>>> - Rahul
>>>>> 
>>>>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>>> Hi, 
>>>>> 
>>>>> Let's say we have a C* cluster with following parameters:
>>>>>  - 50 nodes in the cluster
>>>>>  - RF=3 
>>>>>  - vnodes=256 per node
>>>>>  - CL for some queries = QUORUM
>>>>>  - endpoint_snitch = SimpleSnitch
>>>>> 
>>>>> Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?
>>>>> 
>>>>> Regards, 
>>>>> Kyrill
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>> 
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>>> -- 
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>> 
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

What's the reason behind this negative effect of dynamic_snitch enabled?

Is this true for all C* versions for which this feature is implemented?

Is that because node latencies change too dynamically/sporadically while values is dynamic_snitch tune slower 'than required' and can't keep up with these changes?

In case dynamic_snitch is disabled what algorithm is used to determine which replica should be read  (read requests, not digest requests)?


Regards,

Kyrill

________________________________
From: Jon Haddad <jo...@gmail.com> on behalf of Jon Haddad <jo...@jonhaddad.com>
Sent: Thursday, January 18, 2018 12:49:02 AM
To: user
Subject: Re: vnodes: high availability

I *strongly* recommend disabling dynamic snitch.  I’ve seen it make latency jump 10x.

dynamic_snitch: false is your friend.



On Jan 17, 2018, at 2:00 PM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Avi,
If we prefer to have better balancing [like absence of hotspots during a node down event etc], large number of vnodes is a good solution.
Personally, I wouldn't prefer any balancing over overall resiliency  (and in case of non-optimal setup, larger number of nodes in a cluster decreases overall resiliency, as far as I understand.)

Talking about hotspots, there is a number of features helping to mitigate the issue, for example:
  - dynamic snitch [if a node overloaded it won't be queried]
  - throttling of streaming operations

Thanks,
Kyrill

________________________________
From: Avi Kivity <av...@scylladb.com>>
Sent: Wednesday, January 17, 2018 2:50 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>; kurt greaves
Subject: Re: vnodes: high availability

On the flip side, a large number of vnodes is also beneficial. For example, if you add a node to a 20-node cluster with many vnodes, each existing node will contribute 5% of the data towards the new node, and all nodes will participate in streaming (meaning the impact on any single node will be limited, and completion time will be faster).

With a low number of vnodes, only a few nodes participate in streaming, which means that the cluster is left unbalanced and the impact on each streaming node is greater (or that completion time is slower).

Similarly, with a high number of vnodes, if a node is down its work is distributed equally among all nodes. With a low number of vnodes the cluster becomes unbalanced.

Overall I recommend high vnode count, and to limit the impact of failures in other ways (smaller number of large nodes vs. larger number of small nodes).

btw, rack-aware topology improves the multi-failure problem but at the cost of causing imbalance during maintenance operations. I recommend using rack-aware topology only if you really have racks with single-points-of-failure, not for other reasons.

On 01/17/2018 05:43 AM, kurt greaves wrote:
Even with a low amount of vnodes you're asking for a bad time. Even if you managed to get down to 2 vnodes per node, you're still likely to include double the amount of nodes in any streaming/repair operation which will likely be very problematic for incremental repairs, and you still won't be able to easily reason about which nodes are responsible for which token ranges. It's still quite likely that a loss of 2 nodes would mean some portion of the ring is down (at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens if you can; a lot of work still needs to be done to ensure smooth operation of C* while using vnodes, and they are much more difficult to reason about (which is probably the reason no one has bothered to do the math). If you're really keen on the math your best bet is to do it yourself, because it's not a point of interest for many C* devs plus probably a lot of us wouldn't remember enough math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 3.0.
Based on your info and comments in the ticket it's really a bad idea to have small number of vnodes for the versions using old allocation method because of hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 2.1.]


Regards,
Kyrill
[CASSANDRA-7032] Improve vnode allocation - ASF JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org/>
It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic ...


________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 8:21:33 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill
________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.

Regards,
Kyrill
________________________________
From: kurt greaves <ku...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,
Kyrill
________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Hi,

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?

Regards,
Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Jon Haddad <jo...@jonhaddad.com>.

I *strongly* recommend disabling dynamic snitch.  I’ve seen it make latency jump 10x.  

dynamic_snitch: false is your friend.



> On Jan 17, 2018, at 2:00 PM, Kyrylo Lebediev <Ky...@epam.com> wrote:
> 
> Avi, 
> If we prefer to have better balancing [like absence of hotspots during a node down event etc], large number of vnodes is a good solution.
> Personally, I wouldn't prefer any balancing over overall resiliency  (and in case of non-optimal setup, larger number of nodes in a cluster decreases overall resiliency, as far as I understand.) 
> 
> Talking about hotspots, there is a number of features helping to mitigate the issue, for example:
>   - dynamic snitch [if a node overloaded it won't be queried]
>   - throttling of streaming operations
> 
> Thanks, 
> Kyrill
> 
> From: Avi Kivity <av...@scylladb.com>
> Sent: Wednesday, January 17, 2018 2:50 PM
> To: user@cassandra.apache.org; kurt greaves
> Subject: Re: vnodes: high availability
>  
> On the flip side, a large number of vnodes is also beneficial. For example, if you add a node to a 20-node cluster with many vnodes, each existing node will contribute 5% of the data towards the new node, and all nodes will participate in streaming (meaning the impact on any single node will be limited, and completion time will be faster).
> 
> With a low number of vnodes, only a few nodes participate in streaming, which means that the cluster is left unbalanced and the impact on each streaming node is greater (or that completion time is slower).
> 
> Similarly, with a high number of vnodes, if a node is down its work is distributed equally among all nodes. With a low number of vnodes the cluster becomes unbalanced.
> 
> Overall I recommend high vnode count, and to limit the impact of failures in other ways (smaller number of large nodes vs. larger number of small nodes).
> 
> btw, rack-aware topology improves the multi-failure problem but at the cost of causing imbalance during maintenance operations. I recommend using rack-aware topology only if you really have racks with single-points-of-failure, not for other reasons.
> 
> On 01/17/2018 05:43 AM, kurt greaves wrote:
>> Even with a low amount of vnodes you're asking for a bad time. Even if you managed to get down to 2 vnodes per node, you're still likely to include double the amount of nodes in any streaming/repair operation which will likely be very problematic for incremental repairs, and you still won't be able to easily reason about which nodes are responsible for which token ranges. It's still quite likely that a loss of 2 nodes would mean some portion of the ring is down (at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens if you can; a lot of work still needs to be done to ensure smooth operation of C* while using vnodes, and they are much more difficult to reason about (which is probably the reason no one has bothered to do the math). If you're really keen on the math your best bet is to do it yourself, because it's not a point of interest for many C* devs plus probably a lot of us wouldn't remember enough math to know how to approach it.
>> 
>> If you want to get out of this situation you'll need to do a DC migration to a new DC with a better configuration of snitch/replication strategy/racks/tokens.
>> 
>> 
>> On 16 January 2018 at 21:54, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>> Thank you for this valuable info, Jon.
>> I guess both you and Alex are referring to improved vnodes allocation method  https://issues.apache.org/jira/browse/CASSANDRA-7032 <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was implemented in 3.0.
>> Based on your info and comments in the ticket it's really a bad idea to have small number of vnodes for the versions using old allocation method because of hot-spots, so it's not an option for my particular case (v.2.1) :( 
>> 
>> [As far as I can see from the source code this new method wasn't backported to 2.1.]
>> 
>> 
>> Regards, 
>> Kyrill
>> [CASSANDRA-7032] Improve vnode allocation - ASF JIRA <https://issues.apache.org/jira/browse/CASSANDRA-7032>
>> issues.apache.org <http://issues.apache.org/>
>> It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic ...
>> 
>> From: Jon Haddad <jonathan.haddad@gmail.com <ma...@gmail.com>> on behalf of Jon Haddad <jon@jonhaddad.com <ma...@jonhaddad.com>>
>> Sent: Tuesday, January 16, 2018 8:21:33 PM
>> 
>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>> Subject: Re: vnodes: high availability
>>  
>> We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.
>> 
>> I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?
>> 
>> Jon
>> 
>>> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>> 
>>> Agree with you, Jon.
>>> Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
>>> We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.
>>> 
>>> Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .
>>> 
>>> Jon,  
>>> Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1? 
>>> 
>>> vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
>>> vnodes=4 seems to be better from HA + balancing trade-off
>>> 
>>> Thanks, 
>>> Kyrill
>>> From: Jon Haddad <jonathan.haddad@gmail.com <ma...@gmail.com>> on behalf of Jon Haddad <jon@jonhaddad.com <ma...@jonhaddad.com>>
>>> Sent: Tuesday, January 16, 2018 6:44:53 PM
>>> To: user
>>> Subject: Re: vnodes: high availability
>>>  
>>> While all the token math is helpful, I have to also call out the elephant in the room:
>>> 
>>> You have not correctly configured Cassandra for production.
>>> 
>>> If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE. 
>>> 
>>> You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.
>>> 
>>> 
>>> 
>>>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>> 
>>>> ...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
>>>> More nodes in a cluster means higher probability of simultaneous node failures.
>>>> And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.
>>>> 
>>>> Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
>>>> In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.  
>>>> 
>>>> Regards, 
>>>> Kyrill
>>>> From: kurt greaves <kurt@instaclustr.com <ma...@instaclustr.com>>
>>>> Sent: Tuesday, January 16, 2018 2:16:34 AM
>>>> To: User
>>>> Subject: Re: vnodes: high availability
>>>>  
>>>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>>>> 
>>>> If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM. 
>>>> 
>>>> Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.
>>>> 
>>>> On 15 January 2018 at 19:59, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>> Thanks Alexander!
>>>> 
>>>> I'm not a MS in math too) Unfortunately.
>>>> 
>>>> Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).
>>>> 
>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>> of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM). 
>>>> 
>>>> That's because vnodes_per_node=8 > Nnodes=6.
>>>> As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
>>>> Please, correct me if I'm wrong.
>>>> 
>>>> How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node? 
>>>> 
>>>> Regards, 
>>>> Kyrill
>>>> 
>>>> From: Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>>
>>>> Sent: Monday, January 15, 2018 8:14:21 PM
>>>> 
>>>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>>>> Subject: Re: vnodes: high availability
>>>>  
>>>> I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
>>>> The more vnodes, the smaller the chunks of data you may lose, and vice versa.
>>>> I officially suck at statistics, as expected :)
>>>> 
>>>> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>> a écrit :
>>>> Hi Kyrylo,
>>>> 
>>>> the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
>>>> If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.
>>>> 
>>>> I kinda suck at maths but I'm going to risk making a fool of myself :)
>>>> 
>>>> The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
>>>> Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
>>>> Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.
>>>> 
>>>> Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
>>>> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)
>>>> 
>>>> How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.
>>>> 
>>>> Then you have rack awareness that comes with NetworkTopologyStrategy : 
>>>> If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
>>>> In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
>>>> If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).
>>>> 
>>>> That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.
>>>> 
>>>> Feel free to correct my numbers if I'm wrong.
>>>> 
>>>> Cheers,
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>> Thanks, Rahul.
>>>> But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.
>>>> 
>>>> As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:
>>>> 
>>>> 1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
>>>> 2) there are a lot of vnodes on each node
>>>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>>>> 
>>>> we get all physical nodes (servers) having mutually adjacent  token rages.
>>>> Is it correct?
>>>> 
>>>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:
>>>> 
>>>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l
>>>> 
>>>> returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>>>> 
>>>> Thanks,
>>>> Kyrill
>>>> From: Rahul Neelakantan <rahul@rahul.be <ma...@rahul.be>>
>>>> Sent: Monday, January 15, 2018 5:20:20 PM
>>>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>>>> Subject: Re: vnodes: high availability
>>>>  
>>>> Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram 
>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>> 
>>>> In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.
>>>> 
>>>> You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode
>>>> 
>>>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>>>> 
>>>> - Rahul
>>>> 
>>>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>> Hi, 
>>>> 
>>>> Let's say we have a C* cluster with following parameters:
>>>>  - 50 nodes in the cluster
>>>>  - RF=3 
>>>>  - vnodes=256 per node
>>>>  - CL for some queries = QUORUM
>>>>  - endpoint_snitch = SimpleSnitch
>>>> 
>>>> Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?
>>>> 
>>>> Regards, 
>>>> Kyrill
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> -----------------
>>>> Alexander Dejanovski
>>>> France
>>>> @alexanderdeja
>>>> 
>>>> Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>> -- 
>>>> -----------------
>>>> Alexander Dejanovski
>>>> France
>>>> @alexanderdeja
>>>> 
>>>> Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Avi,

If we prefer to have better balancing [like absence of hotspots during a node down event etc], large number of vnodes is a good solution.

Personally, I wouldn't prefer any balancing over overall resiliency  (and in case of non-optimal setup, larger number of nodes in a cluster decreases overall resiliency, as far as I understand.)


Talking about hotspots, there is a number of features helping to mitigate the issue, for example:

  - dynamic snitch [if a node overloaded it won't be queried]

  - throttling of streaming operations

Thanks,
Kyrill

________________________________
From: Avi Kivity <av...@scylladb.com>
Sent: Wednesday, January 17, 2018 2:50 PM
To: user@cassandra.apache.org; kurt greaves
Subject: Re: vnodes: high availability


On the flip side, a large number of vnodes is also beneficial. For example, if you add a node to a 20-node cluster with many vnodes, each existing node will contribute 5% of the data towards the new node, and all nodes will participate in streaming (meaning the impact on any single node will be limited, and completion time will be faster).


With a low number of vnodes, only a few nodes participate in streaming, which means that the cluster is left unbalanced and the impact on each streaming node is greater (or that completion time is slower).


Similarly, with a high number of vnodes, if a node is down its work is distributed equally among all nodes. With a low number of vnodes the cluster becomes unbalanced.


Overall I recommend high vnode count, and to limit the impact of failures in other ways (smaller number of large nodes vs. larger number of small nodes).


btw, rack-aware topology improves the multi-failure problem but at the cost of causing imbalance during maintenance operations. I recommend using rack-aware topology only if you really have racks with single-points-of-failure, not for other reasons.

On 01/17/2018 05:43 AM, kurt greaves wrote:
Even with a low amount of vnodes you're asking for a bad time. Even if you managed to get down to 2 vnodes per node, you're still likely to include double the amount of nodes in any streaming/repair operation which will likely be very problematic for incremental repairs, and you still won't be able to easily reason about which nodes are responsible for which token ranges. It's still quite likely that a loss of 2 nodes would mean some portion of the ring is down (at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens if you can; a lot of work still needs to be done to ensure smooth operation of C* while using vnodes, and they are much more difficult to reason about (which is probably the reason no one has bothered to do the math). If you're really keen on the math your best bet is to do it yourself, because it's not a point of interest for many C* devs plus probably a lot of us wouldn't remember enough math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 3.0.

Based on your info and comments in the ticket it's really a bad idea to have small number of vnodes for the versions using old allocation method because of hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org>
It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic ...


________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 8:21:33 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill
________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.

Regards,
Kyrill
________________________________
From: kurt greaves <ku...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,
Kyrill
________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Hi,

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?

Regards,
Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Avi Kivity <av...@scylladb.com>.

On the flip side, a large number of vnodes is also beneficial. For 
example, if you add a node to a 20-node cluster with many vnodes, each 
existing node will contribute 5% of the data towards the new node, and 
all nodes will participate in streaming (meaning the impact on any 
single node will be limited, and completion time will be faster).


With a low number of vnodes, only a few nodes participate in streaming, 
which means that the cluster is left unbalanced and the impact on each 
streaming node is greater (or that completion time is slower).


Similarly, with a high number of vnodes, if a node is down its work is 
distributed equally among all nodes. With a low number of vnodes the 
cluster becomes unbalanced.


Overall I recommend high vnode count, and to limit the impact of 
failures in other ways (smaller number of large nodes vs. larger number 
of small nodes).


btw, rack-aware topology improves the multi-failure problem but at the 
cost of causing imbalance during maintenance operations. I recommend 
using rack-aware topology only if you really have racks with 
single-points-of-failure, not for other reasons.


On 01/17/2018 05:43 AM, kurt greaves wrote:
> Even with a low amount of vnodes you're asking for a bad time. Even if 
> you managed to get down to 2 vnodes per node, you're still likely to 
> include double the amount of nodes in any streaming/repair operation 
> which will likely be very problematic for incremental repairs, and you 
> still won't be able to easily reason about which nodes are responsible 
> for which token ranges. It's still quite likely that a loss of 2 nodes 
> would mean some portion of the ring is down (at QUORUM). At the moment 
> I'd say steer clear of vnodes and use single tokens if you can; a lot 
> of work still needs to be done to ensure smooth operation of C* while 
> using vnodes, and they are much more difficult to reason about (which 
> is probably the reason no one has bothered to do the math). If you're 
> really keen on the math your best bet is to do it yourself, because 
> it's not a point of interest for many C* devs plus probably a lot of 
> us wouldn't remember enough math to know how to approach it.
>
> If you want to get out of this situation you'll need to do a DC 
> migration to a new DC with a better configuration of 
> snitch/replication strategy/racks/tokens.
>
>
> On 16 January 2018 at 21:54, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com 
> <ma...@epam.com>> wrote:
>
>     Thank you for this valuable info, Jon.
>     I guess both you and Alex are referring to improved vnodes
>     allocation method
>     https://issues.apache.org/jira/browse/CASSANDRA-7032
>     <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was
>     implemented in 3.0.
>
>     Based on your info and comments in the ticket it's really a bad
>     idea to have small number of vnodes for the versions using old
>     allocation method because of hot-spots, so it's not an option for
>     my particular case (v.2.1) :(
>
>     [As far as I can see from the source code this new method
>     wasn't backported to 2.1.]
>
>
>
>     Regards,
>
>     Kyrill
>
>     [CASSANDRA-7032] Improve vnode allocation - ASF JIRA
>     <https://issues.apache.org/jira/browse/CASSANDRA-7032>
>     issues.apache.org <http://issues.apache.org>
>     It's been known for a little while that random vnode allocation
>     causes hotspots of ownership. It should be possible to improve
>     dramatically on this with deterministic ...
>
>
>     ------------------------------------------------------------------------
>     *From:* Jon Haddad <jonathan.haddad@gmail.com
>     <ma...@gmail.com>> on behalf of Jon Haddad
>     <jon@jonhaddad.com <ma...@jonhaddad.com>>
>     *Sent:* Tuesday, January 16, 2018 8:21:33 PM
>
>     *To:* user@cassandra.apache.org <ma...@cassandra.apache.org>
>     *Subject:* Re: vnodes: high availability
>     We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the
>     randomness.  There’s going to be some imbalance, the amount of
>     imbalance depends on luck, unfortunately.
>
>     I’m interested to hear your results using 4 tokens, would you mind
>     letting the ML know your experience when you’ve done it?
>
>     Jon
>
>>     On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev
>>     <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>
>>     Agree with you, Jon.
>>     Actually, this cluster was configured by my 'predecessor' and
>>     [fortunately for him] we've never met :)
>>     We're using version 2.1.15 and can't upgrade because of legacy
>>     Netflix Astyanax client used.
>>
>>     Below in the thread Alex mentioned that it's recommended to set
>>     vnodes to a value lower than 256 only for C* version > 3.0 (token
>>     allocation algorithm was improved since C* 3.0) .
>>
>>     Jon,
>>     Do you have positive experience setting up cluster with vnodes <
>>     256 for  C* 2.1?
>>
>>     vnodes=32 also too high, as for me (we need to have much more
>>     than 32 servers per AZ in order to to get 'reliable' cluster)
>>     vnodes=4 seems to be better from HA + balancing trade-off
>>
>>     Thanks,
>>     Kyrill
>>     ------------------------------------------------------------------------
>>     *From:*Jon Haddad <jonathan.haddad@gmail.com
>>     <ma...@gmail.com>> on behalf of Jon Haddad
>>     <jon@jonhaddad.com <ma...@jonhaddad.com>>
>>     *Sent:*Tuesday, January 16, 2018 6:44:53 PM
>>     *To:*user
>>     *Subject:*Re: vnodes: high availability
>>     While all the token math is helpful, I have to also call out the
>>     elephant in the room:
>>
>>     You have not correctly configured Cassandra for production.
>>
>>     If you had used the correct endpoint snitch & network topology
>>     strategy, you would be able to withstand the complete failure of
>>     an entire availability zone at QUORUM, or two if you queried at
>>     CL=ONE.
>>
>>     You are correct about 256 tokens causing issues, it’s one of the
>>     reasons why we recommend 32.  I’m curious how things behave going
>>     as low as 4, personally, but I haven’t done the math / tested it yet.
>>
>>
>>
>>>     On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev
>>>     <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>>>
>>>     ...to me it sounds like 'C* isn't that highly-available by
>>>     design as it's declared'.
>>>     More nodes in a cluster means higher probability of simultaneous
>>>     node failures.
>>>     And from high-availability standpoint, looks like situation is
>>>     made even worse by recommendedsettingvnodes=256.
>>>
>>>     Need to do some math to get numbers/formulas, but now situation
>>>     doesn't seem to be promising.
>>>     In case smb from C* developers/architects is reading this
>>>     message, I'd be grateful to get some links to calculations of C*
>>>     reliability based on which decisions were made.
>>>
>>>     Regards,
>>>     Kyrill
>>>     ------------------------------------------------------------------------
>>>     *From:*kurt greaves <kurt@instaclustr.com
>>>     <ma...@instaclustr.com>>
>>>     *Sent:*Tuesday, January 16, 2018 2:16:34 AM
>>>     *To:*User
>>>     *Subject:*Re: vnodes: high availability
>>>     Yeah it's very unlikely that you will have 2 nodes in the
>>>     cluster with NO intersecting token ranges (vnodes) for an RF of
>>>     3 (probably even 2).
>>>
>>>     If node A goes down all 256 ranges will go down, and considering
>>>     there are only 49 other nodes all with 256 vnodes each, it's
>>>     very likely that every node will be responsible for some range A
>>>     was also responsible for. I'm not sure what the exact math is,
>>>     but think of it this way: If on each node, any of its 256 token
>>>     ranges overlap (it's within the next RF-1 or previous RF-1 token
>>>     ranges) on the ring with a token range on node A those token
>>>     ranges will be down at QUORUM.
>>>
>>>     Because token range assignment just uses rand() under the hood,
>>>     I'm sure you could prove that it's always going to be the case
>>>     that any 2 nodes going down result in a loss of QUORUM for some
>>>     token range.
>>>
>>>     On 15 January 2018 at 19:59, Kyrylo
>>>     Lebediev<Kyrylo_Lebediev@epam.com
>>>     <ma...@epam.com>>wrote:
>>>
>>>         Thanks Alexander!
>>>
>>>         I'm not a MS in math too) Unfortunately.
>>>
>>>         Not sure, but it seems to me that probability of 2/49 in
>>>         your explanation doesn't take into account that vnodes
>>>         endpoints are almost evenly distributed across all nodes (al
>>>         least it's what I can see from "nodetool ring" output).
>>>
>>>         http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>>         <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>         of course this vnodes illustration is a theoretical one, but
>>>         there no 2 nodes on that diagram that can be switched off
>>>         without losing a key range (at CL=QUORUM).
>>>
>>>         That's because vnodes_per_node=8 > Nnodes=6.
>>>         As far as I understand, situation is getting worse with
>>>         increase of vnodes_per_node/Nnode ratio.
>>>         Please, correct me if I'm wrong.
>>>
>>>         How would the situation differ from this example by
>>>         DataStax, if we had a real-life 6-nodes cluster with 8
>>>         vnodes on each node?
>>>
>>>         Regards,
>>>         Kyrill
>>>
>>>         ------------------------------------------------------------------------
>>>         *From:*Alexander Dejanovski <alex@thelastpickle.com
>>>         <ma...@thelastpickle.com>>
>>>         *Sent:*Monday, January 15, 2018 8:14:21 PM
>>>
>>>         *To:*user@cassandra.apache.org
>>>         <ma...@cassandra.apache.org>
>>>         *Subject:*Re: vnodes: high availability
>>>         I was corrected off list that the odds of losing data when 2
>>>         nodes are down isn't dependent on the number of vnodes, but
>>>         only on the number of nodes.
>>>         The more vnodes, the smaller the chunks of data you may
>>>         lose, and vice versa.
>>>         I officially suck at statistics, as expected :)
>>>
>>>         Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski
>>>         <alex@thelastpickle.com <ma...@thelastpickle.com>> a
>>>         écrit :
>>>
>>>             Hi Kyrylo,
>>>
>>>             the situation is a bit more nuanced than shown by the
>>>             Datastax diagram, which is fairly theoretical.
>>>             If you're using SimpleStrategy, there is no rack
>>>             awareness. Since vnode distribution is purely random,
>>>             and the replica for a vnode will be placed on the node
>>>             that owns the next vnode in token order (yeah, that's
>>>             not easy to formulate), you end up with statistics only.
>>>
>>>             I kinda suck at maths but I'm going to risk making a
>>>             fool of myself :)
>>>
>>>             The odds for one vnode to be replicated on another node
>>>             are, in your case, 2/49 (out of 49 remaining nodes, 2
>>>             replicas need to be placed).
>>>             Given you have 256 vnodes, the odds for at least one
>>>             vnode of a single node to exist on another one is
>>>             256*(2/49) = 10.4%
>>>             Since the relationship is bi-directional (there are the
>>>             same odds for node B to have a vnode replicated on node
>>>             A than the opposite), that doubles the odds of 2 nodes
>>>             being both replica for at least one vnode : 20.8%.
>>>
>>>             Having a smaller number of vnodes will decrease the
>>>             odds, just as having more nodes in the cluster.
>>>             (now once again, I hope my maths aren't fully wrong, I'm
>>>             pretty rusty in that area...)
>>>
>>>             How many queries that will affect is a different
>>>             question as it depends on which partition currently
>>>             exist and are queried in the unavailable token ranges.
>>>
>>>             Then you have rack awareness that comes with
>>>             NetworkTopologyStrategy :
>>>             If the number of replicas (3 in your case) is
>>>             proportional to the number of racks, Cassandra will
>>>             spread replicas in different ones.
>>>             In that situation, you can theoretically lose as many
>>>             nodes as you want in a single rack, you will still have
>>>             two other replicas available to satisfy quorum in the
>>>             remaining racks.
>>>             If you start losing nodes in different racks, we're back
>>>             to doing maths (but the odds will get slightly different).
>>>
>>>             That makes maintenance predictable because you can shut
>>>             down as many nodes as you want in a single rack without
>>>             losing QUORUM.
>>>
>>>             Feel free to correct my numbers if I'm wrong.
>>>
>>>             Cheers,
>>>
>>>
>>>
>>>
>>>
>>>             On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev
>>>             <Kyrylo_Lebediev@epam.com
>>>             <ma...@epam.com>> wrote:
>>>
>>>                 Thanks, Rahul.
>>>                 But in your example, at the same time loss of Node3
>>>                 and Node6 leads to loss of ranges N, C, J at
>>>                 consistency level QUORUM.
>>>
>>>                 As far as I understand in case vnodes >
>>>                 N_nodes_in_cluster and endpoint_snitch=SimpleSnitch,
>>>                 since:
>>>
>>>                 1) "secondary" replicas are placed on two nodes
>>>                 'next' to the node responsible for a range (in case
>>>                 of RF=3)
>>>                 2) there are a lot of vnodes on each node
>>>                 3) ranges are evenly distribusted between vnodes in
>>>                 case ofSimpleSnitch,
>>>
>>>                 we get all physical nodes (servers) having mutually
>>>                 adjacent token rages.
>>>                 Is it correct?
>>>
>>>                 At least in case of my real-world ~50-nodes cluster
>>>                 with nvodes=256, RF=3 for this command:
>>>
>>>                 nodetool ring | grep '^<ip-prefix>' | awk '{print
>>>                 $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep
>>>                 -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq |
>>>                 wc -l
>>>
>>>                 returned number which equals to Nnodes -1, what
>>>                 means that I can't switch off 2 nodes at the same
>>>                 time w/o losing of some keyrange for CL=QUORUM.
>>>
>>>                 Thanks,
>>>                 Kyrill
>>>                 ------------------------------------------------------------------------
>>>                 *From:*Rahul Neelakantan <rahul@rahul.be
>>>                 <ma...@rahul.be>>
>>>                 *Sent:*Monday, January 15, 2018 5:20:20 PM
>>>                 *To:*user@cassandra.apache.org
>>>                 <ma...@cassandra.apache.org>
>>>                 *Subject:*Re: vnodes: high availability
>>>                 Not necessarily. It depends on how the token ranges
>>>                 for the vNodes are assigned to them. For example
>>>                 take a look at this diagram
>>>                 http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>>                 <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>
>>>                 In the vNode part of the diagram, you will see that
>>>                 Loss of Node 3 and Node 6, will still not have any
>>>                 effect on Token Range A. But yes if you lose two
>>>                 nodes that both have Token Range A assigned to them
>>>                 (Say Node 1 and Node 2), you will have
>>>                 unavailability with your specified configuration.
>>>
>>>                 You can sort of circumvent this by using the
>>>                 DataStax Java Driver and having the client recognize
>>>                 a degraded cluster and operate temporarily in
>>>                 downgraded consistency mode
>>>
>>>                 http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>>>                 <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>>>
>>>                 - Rahul
>>>
>>>                 On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo
>>>                 Lebediev<Kyrylo_Lebediev@epam.com
>>>                 <ma...@epam.com>>wrote:
>>>
>>>                     Hi,
>>>
>>>                     Let's say we have a C* cluster with following
>>>                     parameters:
>>>                      - 50 nodes in the cluster
>>>                      - RF=3
>>>                      - vnodes=256 per node
>>>                      - CL for some queries = QUORUM
>>>                      - endpoint_snitch = SimpleSnitch
>>>
>>>                     Is it correct that 2 any nodes down will cause
>>>                     unavailability of a keyrange at CL=QUORUM?
>>>
>>>                     Regards,
>>>                     Kyrill
>>>
>>>
>>>
>>>
>>>             --
>>>             -----------------
>>>             Alexander Dejanovski
>>>             France
>>>             @alexanderdeja
>>>
>>>             Consultant
>>>             Apache Cassandra Consulting
>>>             http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>
>>>         --
>>>         -----------------
>>>         Alexander Dejanovski
>>>         France
>>>         @alexanderdeja
>>>
>>>         Consultant
>>>         Apache Cassandra Consulting
>>>         http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>
>
>

Re: vnodes: high availability

Posted by kurt greaves <ku...@instaclustr.com>.

Even with a low amount of vnodes you're asking for a bad time. Even if you
managed to get down to 2 vnodes per node, you're still likely to include
double the amount of nodes in any streaming/repair operation which will
likely be very problematic for incremental repairs, and you still won't be
able to easily reason about which nodes are responsible for which token
ranges. It's still quite likely that a loss of 2 nodes would mean some
portion of the ring is down (at QUORUM). At the moment I'd say steer clear
of vnodes and use single tokens if you can; a lot of work still needs to be
done to ensure smooth operation of C* while using vnodes, and they are much
more difficult to reason about (which is probably the reason no one has
bothered to do the math). If you're really keen on the math your best bet
is to do it yourself, because it's not a point of interest for many C* devs
plus probably a lot of us wouldn't remember enough math to know how to
approach it.

If you want to get out of this situation you'll need to do a DC migration
to a new DC with a better configuration of snitch/replication
strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev <Ky...@epam.com>
wrote:

> Thank you for this valuable info, Jon.
> I guess both you and Alex are referring to improved vnodes allocation
> method  https://issues.apache.org/jira/browse/CASSANDRA-7032 which was
> implemented in 3.0.
>
> Based on your info and comments in the ticket it's really a bad idea to
> have small number of vnodes for the versions using old allocation method
> because of hot-spots, so it's not an option for my particular case (v.2.1)
> :(
>
> [As far as I can see from the source code this new method
> wasn't backported to 2.1.]
>
>
>
> Regards,
>
> Kyrill
> [CASSANDRA-7032] Improve vnode allocation - ASF JIRA
> <https://issues.apache.org/jira/browse/CASSANDRA-7032>
> issues.apache.org
> It's been known for a little while that random vnode allocation causes
> hotspots of ownership. It should be possible to improve dramatically on
> this with deterministic ...
>
> ------------------------------
> *From:* Jon Haddad <jo...@gmail.com> on behalf of Jon Haddad <
> jon@jonhaddad.com>
> *Sent:* Tuesday, January 16, 2018 8:21:33 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the
> randomness.  There’s going to be some imbalance, the amount of imbalance
> depends on luck, unfortunately.
>
> I’m interested to hear your results using 4 tokens, would you mind letting
> the ML know your experience when you’ve done it?
>
> Jon
>
> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> Agree with you, Jon.
> Actually, this cluster was configured by my 'predecessor' and [fortunately
> for him] we've never met :)
> We're using version 2.1.15 and can't upgrade because of legacy Netflix
> Astyanax client used.
>
> Below in the thread Alex mentioned that it's recommended to set vnodes to
> a value lower than 256 only for C* version > 3.0 (token allocation
> algorithm was improved since C* 3.0) .
>
> Jon,
> Do you have positive experience setting up  cluster with vnodes < 256 for
> C* 2.1?
>
> vnodes=32 also too high, as for me (we need to have much more than 32
> servers per AZ in order to to get 'reliable' cluster)
> vnodes=4 seems to be better from HA + balancing trade-off
>
> Thanks,
> Kyrill
> ------------------------------
> *From:* Jon Haddad <jo...@gmail.com> on behalf of Jon Haddad <
> jon@jonhaddad.com>
> *Sent:* Tuesday, January 16, 2018 6:44:53 PM
> *To:* user
> *Subject:* Re: vnodes: high availability
>
> While all the token math is helpful, I have to also call out the elephant
> in the room:
>
> You have not correctly configured Cassandra for production.
>
> If you had used the correct endpoint snitch & network topology strategy,
> you would be able to withstand the complete failure of an entire
> availability zone at QUORUM, or two if you queried at CL=ONE.
>
> You are correct about 256 tokens causing issues, it’s one of the reasons
> why we recommend 32.  I’m curious how things behave going as low as 4,
> personally, but I haven’t done the math / tested it yet.
>
>
>
> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> ...to me it sounds like 'C* isn't that highly-available by design as it's
> declared'.
> More nodes in a cluster means higher probability of simultaneous node
> failures.
> And from high-availability standpoint, looks like situation is made even
> worse by recommended setting vnodes=256.
>
> Need to do some math to get numbers/formulas, but now situation doesn't
> seem to be promising.
> In case smb from C* developers/architects is reading this message, I'd be
> grateful to get some links to calculations of C* reliability based on which
> decisions were made.
>
> Regards,
> Kyrill
> ------------------------------
> *From:* kurt greaves <ku...@instaclustr.com>
> *Sent:* Tuesday, January 16, 2018 2:16:34 AM
> *To:* User
> *Subject:* Re: vnodes: high availability
>
> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO
> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>
> If node A goes down all 256 ranges will go down, and considering there are
> only 49 other nodes all with 256 vnodes each, it's very likely that every
> node will be responsible for some range A was also responsible for. I'm not
> sure what the exact math is, but think of it this way: If on each node, any
> of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1
> token ranges) on the ring with a token range on node A those token ranges
> will be down at QUORUM.
>
> Because token range assignment just uses rand() under the hood, I'm sure
> you could prove that it's always going to be the case that any 2 nodes
> going down result in a loss of QUORUM for some token range.
>
> On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> Thanks Alexander!
>
> I'm not a MS in math too) Unfortunately.
>
> Not sure, but it seems to me that probability of 2/49 in your explanation
> doesn't take into account that vnodes endpoints are almost evenly
> distributed across all nodes (al least it's what I can see from "nodetool
> ring" output).
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra
> /architecture/architectureDataDistributeDistribute_c.html
> of course this vnodes illustration is a theoretical one, but there no 2
> nodes on that diagram that can be switched off without losing a key range
> (at CL=QUORUM).
>
> That's because vnodes_per_node=8 > Nnodes=6.
> As far as I understand, situation is getting worse with increase of
> vnodes_per_node/Nnode ratio.
> Please, correct me if I'm wrong.
>
> How would the situation differ from this example by DataStax, if we had a
> real-life 6-nodes cluster with 8 vnodes on each node?
>
> Regards,
> Kyrill
>
> ------------------------------
> *From:* Alexander Dejanovski <al...@thelastpickle.com>
> *Sent:* Monday, January 15, 2018 8:14:21 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> I was corrected off list that the odds of losing data when 2 nodes are
> down isn't dependent on the number of vnodes, but only on the number of
> nodes.
> The more vnodes, the smaller the chunks of data you may lose, and vice
> versa.
> I officially suck at statistics, as expected :)
>
> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <
> alex@thelastpickle.com> a écrit :
>
> Hi Kyrylo,
>
> the situation is a bit more nuanced than shown by the Datastax diagram,
> which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode
> distribution is purely random, and the replica for a vnode will be placed
> on the node that owns the next vnode in token order (yeah, that's not easy
> to formulate), you end up with statistics only.
>
> I kinda suck at maths but I'm going to risk making a fool of myself :)
>
> The odds for one vnode to be replicated on another node are, in your case,
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single
> node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node
> B to have a vnode replicated on node A than the opposite), that doubles the
> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>
> Having a smaller number of vnodes will decrease the odds, just as having
> more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
> that area...)
>
> How many queries that will affect is a different question as it depends on
> which partition currently exist and are queried in the unavailable token
> ranges.
>
> Then you have rack awareness that comes with NetworkTopologyStrategy :
> If the number of replicas (3 in your case) is proportional to the number
> of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in
> a single rack, you will still have two other replicas available to satisfy
> quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths
> (but the odds will get slightly different).
>
> That makes maintenance predictable because you can shut down as many nodes
> as you want in a single rack without losing QUORUM.
>
> Feel free to correct my numbers if I'm wrong.
>
> Cheers,
>
>
>
>
>
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> Thanks, Rahul.
> But in your example, at the same time loss of Node3 and Node6 leads to
> loss of ranges N, C, J at consistency level QUORUM.
>
> As far as I understand in case vnodes > N_nodes_in_cluster and
> endpoint_snitch=SimpleSnitch, since:
>
> 1) "secondary" replicas are placed on two nodes 'next' to the node
> responsible for a range (in case of RF=3)
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
>
> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
> for this command:
>
> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2
> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort |
> uniq | wc -l
>
> returned number which equals to Nnodes -1, what means that I can't switch
> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>
> Thanks,
> Kyrill
> ------------------------------
> *From:* Rahul Neelakantan <ra...@rahul.be>
> *Sent:* Monday, January 15, 2018 5:20:20 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> Not necessarily. It depends on how the token ranges for the vNodes are
> assigned to them. For example take a look at this diagram
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra
> /architecture/architectureDataDistributeDistribute_c.html
>
> In the vNode part of the diagram, you will see that Loss of Node 3 and
> Node 6, will still not have any effect on Token Range A. But yes if you
> lose two nodes that both have Token Range A assigned to them (Say Node 1
> and Node 2), you will have unavailability with your specified configuration.
>
> You can sort of circumvent this by using the DataStax Java Driver and
> having the client recognize a degraded cluster and operate temporarily in
> downgraded consistency mode
>
> http://docs.datastax.com/en/latest-java-driver-api/com/datas
> tax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>
> - Rahul
>
> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Kyrylo_Lebediev@
> epam.com> wrote:
>
> Hi,
>
> Let's say we have a C* cluster with following parameters:
>  - 50 nodes in the cluster
>  - RF=3
>  - vnodes=256 per node
>  - CL for some queries = QUORUM
>  - endpoint_snitch = SimpleSnitch
>
> Is it correct that 2 any nodes down will cause unavailability of a
> keyrange at CL=QUORUM?
>
> Regards,
> Kyrill
>
>
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 3.0.

Based on your info and comments in the ticket it's really a bad idea to have small number of vnodes for the versions using old allocation method because of hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org
It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic ...


________________________________
From: Jon Haddad <jo...@gmail.com> on behalf of Jon Haddad <jo...@jonhaddad.com>
Sent: Tuesday, January 16, 2018 8:21:33 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill
________________________________
From: Jon Haddad <jo...@gmail.com>> on behalf of Jon Haddad <jo...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.

Regards,
Kyrill
________________________________
From: kurt greaves <ku...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,
Kyrill
________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Hi,

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?

Regards,
Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Jon Haddad <jo...@jonhaddad.com>.

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  There’s going to be some imbalance, the amount of imbalance depends on luck, unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the ML know your experience when you’ve done it?

Jon

> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <Ky...@epam.com> wrote:
> 
> Agree with you, Jon.
> Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)
> We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.
> 
> Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .
> 
> Jon,  
> Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1? 
> 
> vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)
> vnodes=4 seems to be better from HA + balancing trade-off
> 
> Thanks, 
> Kyrill
> From: Jon Haddad <jo...@gmail.com> on behalf of Jon Haddad <jo...@jonhaddad.com>
> Sent: Tuesday, January 16, 2018 6:44:53 PM
> To: user
> Subject: Re: vnodes: high availability
>  
> While all the token math is helpful, I have to also call out the elephant in the room:
> 
> You have not correctly configured Cassandra for production.
> 
> If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE. 
> 
> You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.
> 
> 
> 
>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>> 
>> ...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
>> More nodes in a cluster means higher probability of simultaneous node failures.
>> And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.
>> 
>> Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
>> In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.  
>> 
>> Regards, 
>> Kyrill
>> From: kurt greaves <kurt@instaclustr.com <ma...@instaclustr.com>>
>> Sent: Tuesday, January 16, 2018 2:16:34 AM
>> To: User
>> Subject: Re: vnodes: high availability
>>  
>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>> 
>> If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM. 
>> 
>> Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.
>> 
>> On 15 January 2018 at 19:59, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>> Thanks Alexander!
>> 
>> I'm not a MS in math too) Unfortunately.
>> 
>> Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).
>> 
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>> of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM). 
>> 
>> That's because vnodes_per_node=8 > Nnodes=6.
>> As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
>> Please, correct me if I'm wrong.
>> 
>> How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node? 
>> 
>> Regards, 
>> Kyrill
>> 
>> From: Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>>
>> Sent: Monday, January 15, 2018 8:14:21 PM
>> 
>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>> Subject: Re: vnodes: high availability
>>  
>> I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
>> The more vnodes, the smaller the chunks of data you may lose, and vice versa.
>> I officially suck at statistics, as expected :)
>> 
>> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>> a écrit :
>> Hi Kyrylo,
>> 
>> the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
>> If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.
>> 
>> I kinda suck at maths but I'm going to risk making a fool of myself :)
>> 
>> The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
>> Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
>> Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.
>> 
>> Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
>> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)
>> 
>> How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.
>> 
>> Then you have rack awareness that comes with NetworkTopologyStrategy : 
>> If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
>> In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
>> If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).
>> 
>> That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.
>> 
>> Feel free to correct my numbers if I'm wrong.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> 
>> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>> Thanks, Rahul.
>> But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.
>> 
>> As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:
>> 
>> 1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
>> 2) there are a lot of vnodes on each node
>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>> 
>> we get all physical nodes (servers) having mutually adjacent  token rages.
>> Is it correct?
>> 
>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:
>> 
>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l
>> 
>> returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>> 
>> Thanks,
>> Kyrill
>> From: Rahul Neelakantan <rahul@rahul.be <ma...@rahul.be>>
>> Sent: Monday, January 15, 2018 5:20:20 PM
>> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
>> Subject: Re: vnodes: high availability
>>  
>> Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram 
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>> 
>> In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.
>> 
>> You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode
>> 
>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>> 
>> - Rahul
>> 
>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
>> Hi, 
>> 
>> Let's say we have a C* cluster with following parameters:
>>  - 50 nodes in the cluster
>>  - RF=3 
>>  - vnodes=256 per node
>>  - CL for some queries = QUORUM
>>  - endpoint_snitch = SimpleSnitch
>> 
>> Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?
>> 
>> Regards, 
>> Kyrill
>> 
>> 
>> 
>> -- 
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>> 
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>> -- 
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>> 
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Agree with you, Jon.

Actually, this cluster was configured by my 'predecessor' and [fortunately for him] we've never met :)

We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax client used.


Below in the thread Alex mentioned that it's recommended to set vnodes to a value lower than 256 only for C* version > 3.0 (token allocation algorithm was improved since C* 3.0) .


Jon,

Do you have positive experience setting up  cluster with vnodes < 256 for  C* 2.1?


vnodes=32 also too high, as for me (we need to have much more than 32 servers per AZ in order to to get 'reliable' cluster)

vnodes=4 seems to be better from HA + balancing trade-off


Thanks,

Kyrill

________________________________
From: Jon Haddad <jo...@gmail.com> on behalf of Jon Haddad <jo...@jonhaddad.com>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.

Regards,
Kyrill
________________________________
From: kurt greaves <ku...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,
Kyrill
________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:
Hi,

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?

Regards,
Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Jon Haddad <jo...@jonhaddad.com>.

While all the token math is helpful, I have to also call out the elephant in the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you would be able to withstand the complete failure of an entire availability zone at QUORUM, or two if you queried at CL=ONE. 

You are correct about 256 tokens causing issues, it’s one of the reasons why we recommend 32.  I’m curious how things behave going as low as 4, personally, but I haven’t done the math / tested it yet.



> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <Ky...@epam.com> wrote:
> 
> ...to me it sounds like 'C* isn't that highly-available by design as it's declared'.
> More nodes in a cluster means higher probability of simultaneous node failures.
> And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.
> 
> Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.
> In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.  
> 
> Regards, 
> Kyrill
> From: kurt greaves <ku...@instaclustr.com>
> Sent: Tuesday, January 16, 2018 2:16:34 AM
> To: User
> Subject: Re: vnodes: high availability
>  
> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
> 
> If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM. 
> 
> Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.
> 
> On 15 January 2018 at 19:59, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
> Thanks Alexander!
> 
> I'm not a MS in math too) Unfortunately.
> 
> Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).
> 
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
> of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM). 
> 
> That's because vnodes_per_node=8 > Nnodes=6.
> As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.
> Please, correct me if I'm wrong.
> 
> How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node? 
> 
> Regards, 
> Kyrill
> 
> From: Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>>
> Sent: Monday, January 15, 2018 8:14:21 PM
> 
> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
> Subject: Re: vnodes: high availability
>  
> I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
> The more vnodes, the smaller the chunks of data you may lose, and vice versa.
> I officially suck at statistics, as expected :)
> 
> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <alex@thelastpickle.com <ma...@thelastpickle.com>> a écrit :
> Hi Kyrylo,
> 
> the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.
> 
> I kinda suck at maths but I'm going to risk making a fool of myself :)
> 
> The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.
> 
> Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)
> 
> How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.
> 
> Then you have rack awareness that comes with NetworkTopologyStrategy : 
> If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).
> 
> That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.
> 
> Feel free to correct my numbers if I'm wrong.
> 
> Cheers,
> 
> 
> 
> 
> 
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
> Thanks, Rahul.
> But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.
> 
> As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:
> 
> 1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
> 
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
> 
> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:
> 
> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l
> 
> returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
> 
> Thanks,
> Kyrill
> From: Rahul Neelakantan <rahul@rahul.be <ma...@rahul.be>>
> Sent: Monday, January 15, 2018 5:20:20 PM
> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
> Subject: Re: vnodes: high availability
>  
> Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram 
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
> 
> In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.
> 
> You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode
> 
> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
> 
> - Rahul
> 
> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Kyrylo_Lebediev@epam.com <ma...@epam.com>> wrote:
> Hi, 
> 
> Let's say we have a C* cluster with following parameters:
>  - 50 nodes in the cluster
>  - RF=3 
>  - vnodes=256 per node
>  - CL for some queries = QUORUM
>  - endpoint_snitch = SimpleSnitch
> 
> Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?
> 
> Regards, 
> Kyrill
> 
> 
> 
> -- 
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
> 
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com <http://www.thelastpickle.com/>
> -- 
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
> 
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Alex, thanks for detailed explanation.

> The more nodes you have, the smaller will be the subset of data that cannot achieve quorum (so your outage is not as bad as when you have a small number of nodes

Okay, let's say we lost 0.5% of keyrange [for specified CL]. Critically important data chunks may fall into this range.

>If you want more availability but don't want to sacrifice consistency, you can raise your replication factor (if you can afford the extra disk space usage)

IMO, ~1.6x more disk space (5/3) isn't the most important drawback in this case.. For example, with increasing RF we could affect latency for queries with CL=QUORUM (will have to wait for response from 3 nodes for RF=5 instead of waiting for 2 nodes for RF=3). Plus, overhead for most of maintenance operations like anti-entropy repairs should increase with increasing of RF

>Datacenter and rack awareness built in Cassandra can help with availability guarantees : 1 full rack down out of 3 will still allow QUORUM at RF=3

At the same time if we get any 2 nodes  from different racks down in a time (in case of high vnodes values), a keyrange becomes unavailable, as far as I understand. [I stick to CL=QUORUM just for brevity sake. For CL=ONE the issue is similar]

[In the the worst case scenario of any C* setup (with or without vnodes) if we lose two 'neighboring' nodes, we get outage for a keyrange for CL=QUORUM http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html ]

--

I'm not trying criticize C* architecture which gives us good features like linear scalability, but I feel that some math should be done in order to  to elaborate setup best-practices on how to maximize availability for C* clusters (there are a number of best-practices, including those mentioned by Alex, but personally sometimes I can't see which math is behind them)

If anybody in the DL has some math prepared, please, share wit us. I guess, not only me is interested in getting these valuable formulas/graphs.

Thanks,

Kyrill

[http://thelastpickle.com/android-chrome-192x192.png]<http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html>

Down For Me? - The Last Pickle<http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html>
thelastpickle.com
For a read or write request to start in Cassandra at least as many nodes must be seen as UP by the coordinator node as the request has specified via the...

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>
Sent: Tuesday, January 16, 2018 12:50:13 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

Hi Kyrylo,

high availability can be interpreted in many ways, and comes with some tradeoffs with consistency when things go wrong.
A few considerations here :

  *   The more nodes you have, the smaller will be the subset of data that cannot achieve quorum (so your outage is not as bad as when you have a small number of nodes)
  *   If you want more availability but don't want to sacrifice consistency, you can raise your replication factor (if you can afford the extra disk space usage)
  *   Datacenter and rack awareness built in Cassandra can help with availability guarantees : 1 full rack down out of 3 will still allow QUORUM at RF=3 and 2 racks down out of 5 at RF=5. Having one datacenter down (when using LOCAL_QUORUM) allows you to switch to another one and still have a working cluster.
  *   As mentioned in this thread, you can use downgrading retry policies to improve availability at the transient expense of consistency (check if your use case allows it)

Now about vnodes, the recommendation of using 256 is based on statistical analysis of data balance across clusters. Since the token allocation is fully random, it's been observed that 256 vnodes always gave a good balance.
If you're using a version of Cassandra >= 3.0, you can lower that to a value between either 16 or 32 and use the new token allocation algorithm. It will perform several attempts in order to balance a specific keyspace during bootstrap.
Using smaller numbers of vnodes will also improve repair time.
I won't go into statistics again (yikes) and leave it to people that are better at doing maths on how the number of vnodes per node could affect availability.

That brings us to the fact that you can fully disable vnodes and use a single token per node. In that case, you can be sure which nodes are replicas of the same tokens as it follows the ring order : With RF=3, node A tokens are replicated on nodes B and C, and node B tokens are replicated on nodes C and D, and so on.
You get more predictability as to which nodes can be taken down at the same time without losing QUORUM.
But you must afford the operational burden of handling tokens manually, and accept that growing the cluster means doubling the size each time.

The thing to consider is how your apps/services will react in case of transient loss of QUORUM : can you afford eventual consistency ? Is it better to endure full downtime or just on a subset of your partitions ?
And can you design your cluster with racks/datacenters so that you can better predict how to run maintenance operations or if you may be losing QUORUM ?

The way Cassandra is designed also allows linear scalability, which master/slave based databases cannot handle (and master/slave architectures come with their set of challenges, especially during network partitions).

So, while the high availability isn't as transparent as one might think (and I understand why you may be disappointed), you have a lot of options on how to react to partial downtime, and that's something you must consider both when designing your cluster (how it is segmented, how operations are performed), and when designing your apps (how you will use the driver, how your apps will react to failure).

Cheers,

On Tue, Jan 16, 2018 at 11:03 AM Kyrylo Lebediev <Ky...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.

More nodes in a cluster means higher probability of simultaneous node failures.

And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.

In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.

Regards,

Kyrill

________________________________
From: kurt greaves <ku...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User

Subject: Re: vnodes: high availability
Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?

Regards,

Kyrill

________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,

On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,

Kyrill

________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Hi,

Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?

Regards,

Kyrill

--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Hi Kyrylo,

high availability can be interpreted in many ways, and comes with some
tradeoffs with consistency when things go wrong.
A few considerations here :

   - The more nodes you have, the smaller will be the subset of data that
   cannot achieve quorum (so your outage is not as bad as when you have a
   small number of nodes)
   - If you want more availability but don't want to sacrifice consistency,
   you can raise your replication factor (if you can afford the extra disk
   space usage)
   - Datacenter and rack awareness built in Cassandra can help with
   availability guarantees : 1 full rack down out of 3 will still allow QUORUM
   at RF=3 and 2 racks down out of 5 at RF=5. Having one datacenter down (when
   using LOCAL_QUORUM) allows you to switch to another one and still have a
   working cluster.
   - As mentioned in this thread, you can use downgrading retry policies to
   improve availability at the transient expense of consistency (check if your
   use case allows it)

Now about vnodes, the recommendation of using 256 is based on statistical
analysis of data balance across clusters. Since the token allocation is
fully random, it's been observed that 256 vnodes always gave a good balance.
If you're using a version of Cassandra >= 3.0, you can lower that to a
value between either 16 or 32 and use the new token allocation algorithm.
It will perform several attempts in order to balance a specific keyspace
during bootstrap.
Using smaller numbers of vnodes will also improve repair time.
I won't go into statistics again (yikes) and leave it to people that are
better at doing maths on how the number of vnodes per node could affect
availability.

That brings us to the fact that you can fully disable vnodes and use a
single token per node. In that case, you can be sure which nodes are
replicas of the same tokens as it follows the ring order : With RF=3, node
A tokens are replicated on nodes B and C, and node B tokens are replicated
on nodes C and D, and so on.
You get more predictability as to which nodes can be taken down at the same
time without losing QUORUM.
But you must afford the operational burden of handling tokens manually, and
accept that growing the cluster means doubling the size each time.

The thing to consider is how your apps/services will react in case of
transient loss of QUORUM : can you afford eventual consistency ? Is it
better to endure full downtime or just on a subset of your partitions ?
And can you design your cluster with racks/datacenters so that you can
better predict how to run maintenance operations or if you may be losing
QUORUM ?

The way Cassandra is designed also allows linear scalability, which
master/slave based databases cannot handle (and master/slave architectures
come with their set of challenges, especially during network partitions).

So, while the high availability isn't as transparent as one might think
(and I understand why you may be disappointed), you have a lot of options
on how to react to partial downtime, and that's something you must consider
both when designing your cluster (how it is segmented, how operations are
performed), and when designing your apps (how you will use the driver, how
your apps will react to failure).

Cheers,


On Tue, Jan 16, 2018 at 11:03 AM Kyrylo Lebediev <Ky...@epam.com>
wrote:

> ...to me it sounds like 'C* isn't that highly-available by design as it's
> declared'.
>
> More nodes in a cluster means higher probability of simultaneous node
> failures.
>
> And from high-availability standpoint, looks like situation is made even
> worse by recommended setting vnodes=256.
>
>
> Need to do some math to get numbers/formulas, but now situation doesn't
> seem to be promising.
>
> In case smb from C* developers/architects is reading this message, I'd be
> grateful to get some links to calculations of C* reliability based on which
> decisions were made.
>
>
> Regards,
>
> Kyrill
> ------------------------------
> *From:* kurt greaves <ku...@instaclustr.com>
> *Sent:* Tuesday, January 16, 2018 2:16:34 AM
> *To:* User
>
> *Subject:* Re: vnodes: high availability
> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO
> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>
> If node A goes down all 256 ranges will go down, and considering there are
> only 49 other nodes all with 256 vnodes each, it's very likely that every
> node will be responsible for some range A was also responsible for. I'm not
> sure what the exact math is, but think of it this way: If on each node, any
> of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1
> token ranges) on the ring with a token range on node A those token ranges
> will be down at QUORUM.
>
> Because token range assignment just uses rand() under the hood, I'm sure
> you could prove that it's always going to be the case that any 2 nodes
> going down result in a loss of QUORUM for some token range.
>
> On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> Thanks Alexander!
>
>
> I'm not a MS in math too) Unfortunately.
>
>
> Not sure, but it seems to me that probability of 2/49 in your explanation
> doesn't take into account that vnodes endpoints are almost evenly
> distributed across all nodes (al least it's what I can see from "nodetool
> ring" output).
>
>
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
> of course this vnodes illustration is a theoretical one, but there no 2
> nodes on that diagram that can be switched off without losing a key range
> (at CL=QUORUM).
>
>
> That's because vnodes_per_node=8 > Nnodes=6.
>
> As far as I understand, situation is getting worse with increase of
> vnodes_per_node/Nnode ratio.
>
> Please, correct me if I'm wrong.
>
>
> How would the situation differ from this example by DataStax, if we had a
> real-life 6-nodes cluster with 8 vnodes on each node?
>
>
> Regards,
>
> Kyrill
>
>
> ------------------------------
> *From:* Alexander Dejanovski <al...@thelastpickle.com>
> *Sent:* Monday, January 15, 2018 8:14:21 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
>
> I was corrected off list that the odds of losing data when 2 nodes are
> down isn't dependent on the number of vnodes, but only on the number of
> nodes.
> The more vnodes, the smaller the chunks of data you may lose, and vice
> versa.
>
> I officially suck at statistics, as expected :)
>
> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <
> alex@thelastpickle.com> a écrit :
>
> Hi Kyrylo,
>
> the situation is a bit more nuanced than shown by the Datastax diagram,
> which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode
> distribution is purely random, and the replica for a vnode will be placed
> on the node that owns the next vnode in token order (yeah, that's not easy
> to formulate), you end up with statistics only.
>
> I kinda suck at maths but I'm going to risk making a fool of myself :)
>
> The odds for one vnode to be replicated on another node are, in your case,
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single
> node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node
> B to have a vnode replicated on node A than the opposite), that doubles the
> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>
> Having a smaller number of vnodes will decrease the odds, just as having
> more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
> that area...)
>
> How many queries that will affect is a different question as it depends on
> which partition currently exist and are queried in the unavailable token
> ranges.
>
> Then you have rack awareness that comes with NetworkTopologyStrategy :
> If the number of replicas (3 in your case) is proportional to the number
> of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in
> a single rack, you will still have two other replicas available to satisfy
> quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths
> (but the odds will get slightly different).
>
> That makes maintenance predictable because you can shut down as many nodes
> as you want in a single rack without losing QUORUM.
>
> Feel free to correct my numbers if I'm wrong.
>
> Cheers,
>
>
>
>
>
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> Thanks, Rahul.
>
> But in your example, at the same time loss of Node3 and Node6 leads to
> loss of ranges N, C, J at consistency level QUORUM.
>
>
> As far as I understand in case vnodes > N_nodes_in_cluster and
> endpoint_snitch=SimpleSnitch, since:
>
>
> 1) "secondary" replicas are placed on two nodes 'next' to the node
> responsible for a range (in case of RF=3)
>
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>
>
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
>
> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
> for this command:
>
> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2
> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort |
> uniq | wc -l
>
> returned number which equals to Nnodes -1, what means that I can't switch
> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>
>
> Thanks,
>
> Kyrill
> ------------------------------
> *From:* Rahul Neelakantan <ra...@rahul.be>
> *Sent:* Monday, January 15, 2018 5:20:20 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> Not necessarily. It depends on how the token ranges for the vNodes are
> assigned to them. For example take a look at this diagram
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>
> In the vNode part of the diagram, you will see that Loss of Node 3 and
> Node 6, will still not have any effect on Token Range A. But yes if you
> lose two nodes that both have Token Range A assigned to them (Say Node 1
> and Node 2), you will have unavailability with your specified configuration.
>
> You can sort of circumvent this by using the DataStax Java Driver and
> having the client recognize a degraded cluster and operate temporarily in
> downgraded consistency mode
>
>
> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>
> - Rahul
>
> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <
> Kyrylo_Lebediev@epam.com> wrote:
>
> Hi,
>
>
> Let's say we have a C* cluster with following parameters:
>
>  - 50 nodes in the cluster
>
>  - RF=3
>
>  - vnodes=256 per node
>
>  - CL for some queries = QUORUM
>
>  - endpoint_snitch = SimpleSnitch
>
>
> Is it correct that 2 any nodes down will cause unavailability of a
> keyrange at CL=QUORUM?
>
>
> Regards,
>
> Kyrill
>
>
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>

-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

...to me it sounds like 'C* isn't that highly-available by design as it's declared'.

More nodes in a cluster means higher probability of simultaneous node failures.

And from high-availability standpoint, looks like situation is made even worse by recommended setting vnodes=256.


Need to do some math to get numbers/formulas, but now situation doesn't seem to be promising.

In case smb from C* developers/architects is reading this message, I'd be grateful to get some links to calculations of C* reliability based on which decisions were made.


Regards,

Kyrill

________________________________
From: kurt greaves <ku...@instaclustr.com>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 49 other nodes all with 256 vnodes each, it's very likely that every node will be responsible for some range A was also responsible for. I'm not sure what the exact math is, but think of it this way: If on each node, any of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you could prove that it's always going to be the case that any 2 nodes going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill


________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.


As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,


we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.


Thanks,

Kyrill

________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?


Regards,

Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by kurt greaves <ku...@instaclustr.com>.

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are
only 49 other nodes all with 256 vnodes each, it's very likely that every
node will be responsible for some range A was also responsible for. I'm not
sure what the exact math is, but think of it this way: If on each node, any
of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1
token ranges) on the ring with a token range on node A those token ranges
will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure
you could prove that it's always going to be the case that any 2 nodes
going down result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev <Ky...@epam.com>
wrote:

> Thanks Alexander!
>
>
> I'm not a MS in math too) Unfortunately.
>
>
> Not sure, but it seems to me that probability of 2/49 in your explanation
> doesn't take into account that vnodes endpoints are almost evenly
> distributed across all nodes (al least it's what I can see from "nodetool
> ring" output).
>
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/
> architectureDataDistributeDistribute_c.html
> of course this vnodes illustration is a theoretical one, but there no 2
> nodes on that diagram that can be switched off without losing a key range
> (at CL=QUORUM).
>
>
> That's because vnodes_per_node=8 > Nnodes=6.
>
> As far as I understand, situation is getting worse with increase of
> vnodes_per_node/Nnode ratio.
>
> Please, correct me if I'm wrong.
>
>
> How would the situation differ from this example by DataStax, if we had a
> real-life 6-nodes cluster with 8 vnodes on each node?
>
>
> Regards,
>
> Kyrill
>
>
> ------------------------------
> *From:* Alexander Dejanovski <al...@thelastpickle.com>
> *Sent:* Monday, January 15, 2018 8:14:21 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
>
> I was corrected off list that the odds of losing data when 2 nodes are
> down isn't dependent on the number of vnodes, but only on the number of
> nodes.
> The more vnodes, the smaller the chunks of data you may lose, and vice
> versa.
>
> I officially suck at statistics, as expected :)
>
> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <
> alex@thelastpickle.com> a écrit :
>
> Hi Kyrylo,
>
> the situation is a bit more nuanced than shown by the Datastax diagram,
> which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode
> distribution is purely random, and the replica for a vnode will be placed
> on the node that owns the next vnode in token order (yeah, that's not easy
> to formulate), you end up with statistics only.
>
> I kinda suck at maths but I'm going to risk making a fool of myself :)
>
> The odds for one vnode to be replicated on another node are, in your case,
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single
> node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node
> B to have a vnode replicated on node A than the opposite), that doubles the
> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>
> Having a smaller number of vnodes will decrease the odds, just as having
> more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
> that area...)
>
> How many queries that will affect is a different question as it depends on
> which partition currently exist and are queried in the unavailable token
> ranges.
>
> Then you have rack awareness that comes with NetworkTopologyStrategy :
> If the number of replicas (3 in your case) is proportional to the number
> of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in
> a single rack, you will still have two other replicas available to satisfy
> quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths
> (but the odds will get slightly different).
>
> That makes maintenance predictable because you can shut down as many nodes
> as you want in a single rack without losing QUORUM.
>
> Feel free to correct my numbers if I'm wrong.
>
> Cheers,
>
>
>
>
>
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
> Thanks, Rahul.
>
> But in your example, at the same time loss of Node3 and Node6 leads to
> loss of ranges N, C, J at consistency level QUORUM.
>
>
> As far as I understand in case vnodes > N_nodes_in_cluster and
> endpoint_snitch=SimpleSnitch, since:
>
>
> 1) "secondary" replicas are placed on two nodes 'next' to the node
> responsible for a range (in case of RF=3)
>
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>
>
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
>
> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
> for this command:
>
> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2
> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort |
> uniq | wc -l
>
> returned number which equals to Nnodes -1, what means that I can't switch
> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>
>
> Thanks,
>
> Kyrill
> ------------------------------
> *From:* Rahul Neelakantan <ra...@rahul.be>
> *Sent:* Monday, January 15, 2018 5:20:20 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> Not necessarily. It depends on how the token ranges for the vNodes are
> assigned to them. For example take a look at this diagram
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/
> architectureDataDistributeDistribute_c.html
>
> In the vNode part of the diagram, you will see that Loss of Node 3 and
> Node 6, will still not have any effect on Token Range A. But yes if you
> lose two nodes that both have Token Range A assigned to them (Say Node 1
> and Node 2), you will have unavailability with your specified configuration.
>
> You can sort of circumvent this by using the DataStax Java Driver and
> having the client recognize a degraded cluster and operate temporarily in
> downgraded consistency mode
>
> http://docs.datastax.com/en/latest-java-driver-api/com/
> datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>
> - Rahul
>
> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <
> Kyrylo_Lebediev@epam.com> wrote:
>
> Hi,
>
>
> Let's say we have a C* cluster with following parameters:
>
>  - 50 nodes in the cluster
>
>  - RF=3
>
>  - vnodes=256 per node
>
>  - CL for some queries = QUORUM
>
>  - endpoint_snitch = SimpleSnitch
>
>
> Is it correct that 2 any nodes down will cause unavailability of a
> keyrange at CL=QUORUM?
>
>
> Regards,
>
> Kyrill
>
>
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation doesn't take into account that vnodes endpoints are almost evenly distributed across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes on that diagram that can be switched off without losing a key range (at CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill


________________________________
From: Alexander Dejanovski <al...@thelastpickle.com>
Sent: Monday, January 15, 2018 8:14:21 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode distribution is purely random, and the replica for a vnode will be placed on the node that owns the next vnode in token order (yeah, that's not easy to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to have a vnode replicated on node A than the opposite), that doubles the odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that area...)

How many queries that will affect is a different question as it depends on which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a single rack, you will still have two other replicas available to satisfy quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>> wrote:

Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.


As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,


we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.


Thanks,

Kyrill

________________________________
From: Rahul Neelakantan <ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?


Regards,

Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

I was corrected off list that the odds of losing data when 2 nodes are down
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice
versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <al...@thelastpickle.com>
a écrit :

> Hi Kyrylo,
>
> the situation is a bit more nuanced than shown by the Datastax diagram,
> which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode
> distribution is purely random, and the replica for a vnode will be placed
> on the node that owns the next vnode in token order (yeah, that's not easy
> to formulate), you end up with statistics only.
>
> I kinda suck at maths but I'm going to risk making a fool of myself :)
>
> The odds for one vnode to be replicated on another node are, in your case,
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single
> node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node
> B to have a vnode replicated on node A than the opposite), that doubles the
> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>
> Having a smaller number of vnodes will decrease the odds, just as having
> more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
> that area...)
>
> How many queries that will affect is a different question as it depends on
> which partition currently exist and are queried in the unavailable token
> ranges.
>
> Then you have rack awareness that comes with NetworkTopologyStrategy :
> If the number of replicas (3 in your case) is proportional to the number
> of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in
> a single rack, you will still have two other replicas available to satisfy
> quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths
> (but the odds will get slightly different).
>
> That makes maintenance predictable because you can shut down as many nodes
> as you want in a single rack without losing QUORUM.
>
> Feel free to correct my numbers if I'm wrong.
>
> Cheers,
>
>
>
>
>
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>
> wrote:
>
>> Thanks, Rahul.
>>
>> But in your example, at the same time loss of Node3 and Node6 leads to
>> loss of ranges N, C, J at consistency level QUORUM.
>>
>>
>> As far as I understand in case vnodes > N_nodes_in_cluster and
>> endpoint_snitch=SimpleSnitch, since:
>>
>>
>> 1) "secondary" replicas are placed on two nodes 'next' to the node
>> responsible for a range (in case of RF=3)
>>
>> 2) there are a lot of vnodes on each node
>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>>
>>
>> we get all physical nodes (servers) having mutually adjacent  token rages.
>> Is it correct?
>>
>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
>> for this command:
>>
>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2
>> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort |
>> uniq | wc -l
>>
>> returned number which equals to Nnodes -1, what means that I can't switch
>> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>>
>>
>> Thanks,
>>
>> Kyrill
>> ------------------------------
>> *From:* Rahul Neelakantan <ra...@rahul.be>
>> *Sent:* Monday, January 15, 2018 5:20:20 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: vnodes: high availability
>>
>> Not necessarily. It depends on how the token ranges for the vNodes are
>> assigned to them. For example take a look at this diagram
>>
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>
>> In the vNode part of the diagram, you will see that Loss of Node 3 and
>> Node 6, will still not have any effect on Token Range A. But yes if you
>> lose two nodes that both have Token Range A assigned to them (Say Node 1
>> and Node 2), you will have unavailability with your specified configuration.
>>
>> You can sort of circumvent this by using the DataStax Java Driver and
>> having the client recognize a degraded cluster and operate temporarily in
>> downgraded consistency mode
>>
>>
>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>>
>> - Rahul
>>
>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <
>> Kyrylo_Lebediev@epam.com> wrote:
>>
>> Hi,
>>
>>
>> Let's say we have a C* cluster with following parameters:
>>
>>  - 50 nodes in the cluster
>>
>>  - RF=3
>>
>>  - vnodes=256 per node
>>
>>  - CL for some queries = QUORUM
>>
>>  - endpoint_snitch = SimpleSnitch
>>
>>
>> Is it correct that 2 any nodes down will cause unavailability of a
>> keyrange at CL=QUORUM?
>>
>>
>> Regards,
>>
>> Kyrill
>>
>>
>>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: vnodes: high availability

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram,
which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode
distribution is purely random, and the replica for a vnode will be placed
on the node that owns the next vnode in token order (yeah, that's not easy
to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case,
2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node
to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node
B to have a vnode replicated on node A than the opposite), that doubles the
odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having
more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
that area...)

How many queries that will affect is a different question as it depends on
which partition currently exist and are queried in the unavailable token
ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in
a single rack, you will still have two other replicas available to satisfy
quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths
(but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes
as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,

On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <Ky...@epam.com>
wrote:

> Thanks, Rahul.
>
> But in your example, at the same time loss of Node3 and Node6 leads to
> loss of ranges N, C, J at consistency level QUORUM.
>
>
> As far as I understand in case vnodes > N_nodes_in_cluster and
> endpoint_snitch=SimpleSnitch, since:
>
>
> 1) "secondary" replicas are placed on two nodes 'next' to the node
> responsible for a range (in case of RF=3)
>
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>
>
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
>
> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
> for this command:
>
> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2
> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort |
> uniq | wc -l
>
> returned number which equals to Nnodes -1, what means that I can't switch
> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>
>
> Thanks,
>
> Kyrill
> ------------------------------
> *From:* Rahul Neelakantan <ra...@rahul.be>
> *Sent:* Monday, January 15, 2018 5:20:20 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> Not necessarily. It depends on how the token ranges for the vNodes are
> assigned to them. For example take a look at this diagram
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>
> In the vNode part of the diagram, you will see that Loss of Node 3 and
> Node 6, will still not have any effect on Token Range A. But yes if you
> lose two nodes that both have Token Range A assigned to them (Say Node 1
> and Node 2), you will have unavailability with your specified configuration.
>
> You can sort of circumvent this by using the DataStax Java Driver and
> having the client recognize a degraded cluster and operate temporarily in
> downgraded consistency mode
>
>
> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>
> - Rahul
>
> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <
> Kyrylo_Lebediev@epam.com> wrote:
>
> Hi,
>
>
> Let's say we have a C* cluster with following parameters:
>
>  - 50 nodes in the cluster
>
>  - RF=3
>
>  - vnodes=256 per node
>
>  - CL for some queries = QUORUM
>
>  - endpoint_snitch = SimpleSnitch
>
>
> Is it correct that 2 any nodes down will cause unavailability of a
> keyrange at CL=QUORUM?
>
>
> Regards,
>
> Kyrill
>
>
>

-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: vnodes: high availability

Posted by Kyrylo Lebediev <Ky...@epam.com>.

Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of ranges N, C, J at consistency level QUORUM.


As far as I understand in case vnodes > N_nodes_in_cluster and endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,


we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc -l

returned number which equals to Nnodes -1, what means that I can't switch off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.


Thanks,

Kyrill

________________________________
From: Rahul Neelakantan <ra...@rahul.be>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, will still not have any effect on Token Range A. But yes if you lose two nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having the client recognize a degraded cluster and operate temporarily in downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>> wrote:

Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at CL=QUORUM?


Regards,

Kyrill

Re: vnodes: high availability

Posted by Rahul Neelakantan <ra...@rahul.be>.

Not necessarily. It depends on how the token ranges for the vNodes are
assigned to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node
6, will still not have any effect on Token Range A. But yes if you lose two
nodes that both have Token Range A assigned to them (Say Node 1 and Node
2), you will have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and
having the client recognize a degraded cluster and operate temporarily in
downgraded consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <Ky...@epam.com>
wrote:

> Hi,
>
>
> Let's say we have a C* cluster with following parameters:
>
>  - 50 nodes in the cluster
>
>  - RF=3
>
>  - vnodes=256 per node
>
>  - CL for some queries = QUORUM
>
>  - endpoint_snitch = SimpleSnitch
>
>
> Is it correct that 2 any nodes down will cause unavailability of a
> keyrange at CL=QUORUM?
>
>
> Regards,
>
> Kyrill
>