You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2009/06/20 00:12:07 UTC

[jira] Issue Comment Edited: (CASSANDRA-193) Proactive repair

    [ https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722054#action_12722054 ] 

Stu Hood edited comment on CASSANDRA-193 at 6/19/09 3:10 PM:
-------------------------------------------------------------

The current plan is for an AntiEntropyService per table to maintain a Merkle Tree per column family.

The tree will be a implemented as a randomized binary tree in memory (a ''Treap'': http://en.wikipedia.org/wiki/Treap), where every item in the tree represents a range bounded by the dht.Tokens of its left and right neighbors. By placing a bound on the total number of nodes in the tree, we can limit the memory usage. We can compact or split ranges in the tree by removing or adding Tokens. The algorithm for deciding which ranges to compact/split will be described below.

When a write comes in for a given table, we will place 'invalidation' operations in a queue for all affected column families. The ExecutorService for the table will read from the queue and perform the 'invalidations' as fast as it can. For a given Key/Token, if any column family tree is marked as 'invalid', the entire row needs to be read from disk and repaired (at some point in the future).

An 'invalidation' operation does a binary search in the Merkle Tree and marks the matching range as 'invalid', deleting its hash. We will also take advantage of this step to optimize the tree: A ''Treap'' stores a random priority (P) on each node, and by generating a random P' and replacing P for a node iff P' < P as we invalidate it, more frequently invalidated ranges will shift to the top of the tree.

The AEService maintaining the tree for a table will occasionally need to exchange portions of the tree with other nodes. In order to do this, subtrees that both nodes are interested in from all CF trees will have to be locked long enough to recalculate all 'invalid' children, and then the locks can flow down the tree as progressively smaller ranges are exhanged. Doing this locking efficiently is going to be interesting (aka: I haven't thought about it).

Implementing the exchange between nodes is blocked by CASSANDRA-242 because in order to align the subtrees on different nodes, we need to be able to deterministically split two ranges.

In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide an operation that builds a list of invalid ranges to be fetched from disk. During this step, we can also compact/split ranges. Because of our Treap maintenance, frequently invalidated ranges will be near the top of the tree, and stable ranges will be closer to the leaves. By compacting the deepest N nodes and expanding the shallowest N, we can minimizing the size of the ranges that are affected by invalidations in the future.

Given the list of 'invalid' ranges (and pointers directly to the tree nodes), the AEService will fetch the ranges from the current MemTable and SSTables for the CF, hash them, and store the hashes into the relevant nodes. After this operation, we can recursively calculate hashes for inner nodes.

      was (Author: stuhood):
    The current plan is for an AntiEntropyService per table to maintain a Merkle Tree per column family.

The tree will be a implemented as a randomized binary tree in memory (a ''Treap'': http://en.wikipedia.org/wiki/Treap), where every item in the tree represents a range bounded by the dht.Tokens of its left and right neighbors. By placing a bound on the total number of nodes in the tree, we can compact or split ranges in the tree by removing or adding Tokens. The algorithm for deciding which ranges to compact/split will be described below.

When a write comes in for a given table, we will place 'invalidation' operations in a queue for all affected column families. The ExecutorService for the table will read from the queue and perform the 'invalidations' as fast as it can. For a given Key/Token, if any column family tree is marked as 'invalid', the entire row needs to be read from disk and repaired (at some point in the future).

An 'invalidation' operation does a binary search in the Merkle Tree and marks the matching range as 'invalid', deleting its hash. We will also take advantage of this step to optimize the tree: A ''Treap'' stores a random priority (P) on each node, and by generating a random P' and replacing P for a node iff P' < P as we invalidate it, more frequently invalidated ranges will shift to the top of the tree.

The AEService maintaining the tree for a table will occasionally need to exchange portions of the tree with other nodes. In order to do this, subtrees that both nodes are interested in from all CF trees will have to be locked long enough to recalculate all 'invalid' children, and then the locks can flow down the tree as progressively smaller ranges are exhanged. Doing this locking efficiently is going to be interesting (aka: I haven't thought about it).

Implementing the exchange between nodes is blocked by CASSANDRA-242 because in order to align the subtrees on different nodes, we need to be able to deterministically split two ranges.

In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide an operation that builds a list of invalid ranges to be fetched from disk. During this step, we can also compact/split ranges. Because of our Treap maintenance, frequently invalidated ranges will be near the top of the tree, and stable ranges will be closer to the leaves. By compacting the deepest N nodes and expanding the shallowest N, we can minimizing the size of the ranges that are affected by invalidations in the future.

Given the list of 'invalid' ranges (and pointers directly to the tree nodes), the AEService will fetch the ranges from the current MemTable and SSTables for the CF, hash them, and store the hashes into the relevant nodes. After this operation, we can recursively calculate hashes for inner nodes.
  
> Proactive repair
> ----------------
>
>                 Key: CASSANDRA-193
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-193
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>             Fix For: 0.5
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is done.  This is better than nothing but is not sufficient for some cases (e.g. catastrophic node failure where you need to rebuild all of a node's data on a new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF data model but I suppose we could just hash the serialized CF value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.