You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2011/06/27 16:05:48 UTC
[jira] [Resolved] (CASSANDRA-2815) Bad timing in repair can transfer data it is not suppose to

     [ https://issues.apache.org/jira/browse/CASSANDRA-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne resolved CASSANDRA-2815.
-----------------------------------------

    Resolution: Duplicate

Marking as duplicate of CASSANDRA-2816 as the patch there fixes that issue too.

> Bad timing in repair can transfer data it is not suppose to 
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-2815
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2815
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>
> The core of the problem is that the sstables used to construct a merkle tree are not necessarily the same than the ones for which streaming is initiated. This is usually not a big deal: newly compacted sstables don't matter since data hasn't change. Newly flushed data (between the start of the validation compaction and the start of the streaming) are a little bit more problematic, even though one could argue that on average this won't be not too much data.
> But there can be more problematic scenario: suppose a 3 nodes cluster with RF=3, all the data on node3 is removed and then repair is started on node3.  Also suppose the cluster havs two CFs, cf1 and cf2, sharing evenly the data.
> Node3 will request the usual merkle trees and let's pretend validation compaction is mono-threaded to simplify.
> What can happen is the following:
>   # node3 computes its merkle trees for all requests very quickly, having no data. It will thus wait on the other nodes trees (his own trees saying "I have no data").
>   # node1 starts computing its tree for say cf1 (queuing the computation for cf2). In the meantime, node2 may start computing the tree for cf2 (queuing the one for cf1).
>   # when node1 completes his first tree, it sends it to node3. Node3 receives it, compare to his own tree and initiate transfer of all the data for cf1 from node1 to him.
>   # not too long after that, node2 completes it's first tree, the one for cf2 and send it to node3. Based on it, transfer of all the data for cf2 from node2 to node3 starts.
>   # An arbitrary long time after that (compaction validation can take time), node2 will finish his second tree (for cf1 that is and send it back to node3. Node3 will compare to his own (empty) tree and decide that all the ranges should be repaired with node2 for cf1. The problem is that when that happens, the transfer of cf1 from node1 may have been done already, or at least partly done. For that reason, node3 will start streaming all this data to node2.
>   # For the same reasons, node3 may end up transfering all or part of cf2 to node1.
> So the problem is, even though at the beginning node1 and node2 may be perfectly consistent, we will end up streaming potentially huge amount of data to them.
> I think this affects both 0.7 and 0.8, though differently because compaction multi-threading and the fact that 0.8 differentiates the ranges make the timing different.
> One solution (in a way, the theoretically right solution) would be to grab references to the sstables we use for the validation compaction, and only initiate streaming on these sstables. However, in 0.8 where compaction is multi-threaded, this would mean retaining compacted sstables longer that necessary. This is also a bit complicated and in particular would require a network protocol change (because a streaming request message would have to contain some information allowing to decide what set of sstables to use).
> A maybe simpler solution could be to have the node coordinating the repair wait for the tree of all the remotes (in this obviously only for a given column family and range) before starting streaming.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira