You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Branimir Lambov (JIRA)" <ji...@apache.org> on 2015/02/06 17:23:45 UTC

[jira] [Comment Edited] (CASSANDRA-7032) Improve vnode allocation

    [ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309383#comment-14309383 ] 

Branimir Lambov edited comment on CASSANDRA-7032 at 2/6/15 4:23 PM:
--------------------------------------------------------------------

The answer relies on a couple of mappings:
- a Cassandra data centre is a separate ring to the algorithm. Data centres have their own allocation of replicas, hence one can only balance load within them, not across multiple DCs. The algorithm can simply ignore everything outside the data centre where the new node is spawned.
- in current Cassandra terms, a node is a node to the algorithm.
- in future Cassandra terms, a disk is a node to the algorithm*.
- a Cassandra rack is a rack to the algorithm if their number within the data centre is equal to or higher than the replication factor, otherwise the Cassandra node is the rack of the algorithm. Implicitly all disks in a machine belong to the same rack; the machine level can be completely ignored if racks are defined.

The only fine point is starred. I know your preference is to split the token space in a single range per disk. I see little benefit in this compared to simply assigning a subset of the machine's vnodes to each disk. The latter is more flexible, perhaps even easier to achieve when SSTables are per vnode. (Perhaps what's not so easy to see is that each vnode is always responsible for replicating a contiguous range of tokens and it is easy to convert between a vnode and the token range that it serves (see {{ReplicationStrategy.replicationStart}} in the attached).) Slower disks take a small number of vnodes (4?), faster ones take more (8/16?). Bigger machines have more disks hence more vnodes and automatically take a higher load.

The algorithm handles heterogeneous rack and node configuration without issues as long as the number of distinct racks (in algorithm terms) is higher than the replication factor. 


was (Author: blambov):
The answer relies on a couple of mappings:
- a Cassandra data centre is a separate ring to the algorithm. Data centres have their own allocation of replicas, hence one can only balance load within them, not across multiple DCs. The algorithm can simply ignore everything outside the data centre where the new node is spawned.
- in current Cassandra terms, a node is a node to the algorithm.
- in future Cassandra terms, a disk is a node to the algorithm*.
- a Cassandra rack is a rack to the algorithm if there number within the data centre is equal to or higher than the replication factor, otherwise the Cassandra node is the rack of the algorithm. Implicitly all disks in a machine belong to the same rack; the machine level can be completely ignored if racks are defined.

The only fine point is starred. I know your preference is to split the token space in a single range per disk. I see little benefit in this compared to simply assigning a subset of the machine's vnodes to each disk. The latter is more flexible, perhaps even easier to achieve when SSTables are per vnode. (Perhaps what's not so easy to see is that each vnode is always responsible for replicating a contiguous range of tokens and it is easy to convert between a vnode and the token range that it serves (see {{ReplicationStrategy.replicationStart}} in the attached).) Slower disks take a small number of vnodes (4?), faster ones take more (8/16?). Bigger machines have more disks hence more vnodes and automatically take a higher load.

The algorithm handles heterogeneous rack and node configuration without issues as long as the number of distinct racks (in algorithm terms) is higher than the replication factor. 

> Improve vnode allocation
> ------------------------
>
>                 Key: CASSANDRA-7032
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Branimir Lambov
>              Labels: performance, vnodes
>             Fix For: 3.0
>
>         Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better.
> If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)