You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Marcus Eriksson (JIRA)" <ji...@apache.org> on 2017/06/01 12:28:04 UTC

[jira] [Comment Edited] (CASSANDRA-3200) Repair: compare all trees together (for a given range/cf) instead of by pair in isolation

    [ https://issues.apache.org/jira/browse/CASSANDRA-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032897#comment-16032897 ] 

Marcus Eriksson edited comment on CASSANDRA-3200 at 6/1/17 12:27 PM:
---------------------------------------------------------------------

Branch for this here:
https://github.com/krummas/cassandra/commits/marcuse/CASSANDRA-3200
dtests:
https://github.com/krummas/cassandra-dtest/commits/marcuse/mt_calcs

So, this does what is described in the description - if we repair 3 nodes, A, B, C, and B has a range out of sync but A and C are equal, we only stream to B from either A or C.

It does this by introducing 'asymmetric syncing' - when we compare the merkle trees, we let each node track its incoming streams, and whenever we add an incoming stream, we check if we are already streaming the same data from another node. This might increase the number of SyncRequest messages sent by repair coordinator since it only ever asks remote nodes to fetch ranges from other nodes, never push out any (this could be optimised ofc, but I doubt it is a problem).

It does not compare the leaves in the merkle trees, instead it denormalizes the ranges as we add them, for example, say that node {{A}} has an incoming stream from {{B}} on {{[0, 100)}}, but then we add that {{A}} needs to stream {{[50, 100)}} from {{C}}, then the resulting incoming streams to {{A}} would be {{[0, 50)}} from {{B}} and {{[50, 100)}} from *either* {{B}} or {{C}} (assuming {{B}} and {{C}} are equal on the range {{[50, 100)}})

It tries to pick the least loaded node when we have the option to stream from several nodes, with preference to same-dc nodes.

Old symmetric syncing can be run by passing {{-ss}} to nodetool repair


was (Author: krummas):
Branch for this here:
https://github.com/krummas/cassandra/commits/marcuse/CASSANDRA-3200
dtests:
https://github.com/krummas/cassandra-dtest/commits/marcuse/mt_calcs

So, this does what is described in the description - if we repair 3 nodes, A, B, C, and B has a range out of sync but A and C are equal, we only stream to B from either A or C.

It does this by introducing 'asymmetric syncing' - when we compare the merkle trees, we let each node track its incoming streams, and whenever we add an incoming stream, we check if we are already streaming the same data from another node. This might increase the number of SyncRequest messages sent by repair coordinator since it only ever asks remote nodes to fetch ranges from other nodes, never push out any (this could be optimised ofc, but I doubt it is a problem).

It does not compare the leaves in the merkle trees, instead it denormalizes the ranges as we add them, for example, say that node {{A}} has an incoming stream from {{B}} on {{[0, 100)}}, but then we add that {{A}} needs to stream {{[50, 100)}} from {{C}}, then the resulting incoming streams to {{A}} would be {{[0, 50)}} from {{B}} and {{[50, 100)}} from *either* {{B}} or {{C}} (assuming {{B}} and {{C}} are equal on the range {[50, 100)}})

It tries to pick the least loaded node when we have the option to stream from several nodes, with preference to same-dc nodes.

Old symmetric syncing can be run by passing {{-ss}} to nodetool repair

> Repair: compare all trees together (for a given range/cf) instead of by pair in isolation
> -----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3200
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3200
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Marcus Eriksson
>            Priority: Minor
>              Labels: repair
>             Fix For: 4.x
>
>
> Currently, repair compare merkle trees by pair, in isolation of any other tree. What that means concretely is that if I have three node A, B and C (RF=3) with A and B in sync, but C having some range r inconsitent with both A and B (since those are consistent), we will do the following transfer of r: A -> C, C -> A, B -> C, C -> B.
> The fact that we do both A -> C and C -> A is fine, because we cannot know which one is more to date from A or C. However, the transfer B -> C is useless provided we do A -> C if A and B are in sync. Not doing that transfer will be a 25% improvement in that case. With RF=5 and only one node inconsistent with all the others, that almost a 40% improvement, etc...
> Given that this situation of one node not in sync while the others are is probably fairly common (one node died so it is behind), this could be a fair improvement over what is transferred. In the case where we use repair to rebuild completely a node, this will be a dramatic improvement, because it will avoid the rebuilded node to get RF times the data it should get.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org