You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "sankalp kohli (JIRA)" <ji...@apache.org> on 2013/10/09 05:05:43 UTC

[jira] [Commented] (CASSANDRA-4650) RangeStreamer should be smarter when picking endpoints for streaming in case of N >=3 in each DC.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789969#comment-13789969 ] 

sankalp kohli commented on CASSANDRA-4650:
------------------------------------------

This will be very useful for vnodes as it will even out the load of bootstrapping across the cluster. Also the bootstrap will be faster as it will spread the ranges evenly across the nodes. 

This problem is basically about mapping x ranges to y machines with constraints of replication. So it can be modeled as a max network flow problem. 
I am attaching a picture of a graph. In the picture, R1 to R6 are 6 ranges being mapped to 4 machines(M1 to m4). Here the replication is 2 as each range can be mapped to two machines. All the edges in this graph have weight 1 except the edges between M1 to M4 and t. 
The edges between M1 to M4 and t will be ceil(No of ranges/No of machines) initially(2 in the picture). If we cannot push a flow of 6(i.e no of ranges) from s to t, then we will increase the weight on edges between M1 to M4 and t by 1(in our example to 3).
We will continue to increase the weight by 1 till we have a flow of 6 from s to t.  

We also need to take care of the multi-DC deployment as we dont want to stream from across DC to optimize the load. 

> RangeStreamer should be smarter when picking endpoints for streaming in case of N >=3 in each DC.  
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4650
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4650
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.1.5
>            Reporter: sankalp kohli
>            Assignee: sankalp kohli
>            Priority: Minor
>              Labels: streaming
>         Attachments: photo-1.JPG
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> getRangeFetchMap method in RangeStreamer should pick unique nodes to stream data from when number of replicas in each DC is three or more. 
> When N>=3 in a DC, there are two options for streaming a range. Consider an example of 4 nodes in one datacenter and replication factor of 3. 
> If a node goes down, it needs to recover 3 ranges of data. With current code, two nodes could get selected as it orders the node by proximity. 
> We ideally will want to select 3 nodes for streaming the data. We can do this by selecting unique nodes for each range.  
> Advantages:
> This will increase the performance of bootstrapping a node and will also put less pressure on nodes serving the data. 
> Note: This does not affect if N < 3 in each DC as then it streams data from only 2 nodes. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)