You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/06/10 06:45:07 UTC

[jira] Commented: (CASSANDRA-193) Proactive repair

    [ https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717914#action_12717914 ] 

Jonathan Ellis commented on CASSANDRA-193:
------------------------------------------

To start with the good news: one thing which may seem on the face of it to be a problem, isn't really.  That is, how do you get nodes replicating a given token range to agree where to freeze or snapshot the data set to be repaired, in the face of continuing updates?  The answer is, you don't; it doesn't matter.  If we repair a few columnfamilies that don't really need it (because one of the nodes was just a bit slower to process an update than the other), that's no big deal.  We accept that and move on.

The bad news is, I don't see a clever solution for performing broad-based repair against the Memtable/SSTable model similar to Merkle trees for Dynamo/bdb.  (Of course, that is no guarantee that none such exists. :)

There are several difficulties.  (In passing, it's worth noting that Bigtable sidesteps these issues by writing both commit logs and sstables to GFS, which takes care of durability.  Here we have to do more work in exchange for a simpler model and better performance on ordinary reads and writes.)

One difficulty lies in how data in one SSTable may be pre-empted by another.  Because of this, any hash-based "summary" of a row may be obsoleted by rows in another.  For some workloads, particularly ones in which most keys are updated infrequently, caching such a summary in the sstable or index file might still be useful, but it should be kept in mind that in the worst case these will just be wasted effort.

(I think it would be a mistake to address this by forcing a major compaction -- combining all sstables for the columnfamily into one -- as a prerequisite to repair.  Reading and rewriting _all_ the data for _each_ repair is a significant amount of extra I/O.)

Another is that token regions do not correspond 1:1 to sstables, because each node is responsible for N token regions -- the regions for which is is the primary, secondar, tertiary, etc. repository for -- all intermingled in the SSTable files.  So any precomputation would need to be done separately N times.

Finally, we can't assume that sstable or even just row key names will fit into the heap, which limits the kind of in-memory structures we can build.

So from what I do not think it is worth the complexity to attempt to cache per-row hashes or summaries of the sstable data in the sstable or index files.

So the approach I propose is simply to iterate through the key space on a per-CF basis, compute a hash, and repair if there is a mismatch.  The code to iterate keys is already there (for the compaction code) and so is the code to compute hashes and repair if a mismatch is found (for read repair).  I think it will be worth flushing the current memtable first to avoid having to take a read lock on it.

Enhancements could include building a merkle tree from each batch of hashes to minimize round trips -- although unfortunately I think that is not going to be a bottleneck for Cassandra compared to the hash computation -- and fixing the compaction and hash computation code to iterate through columns in a CF rather than deserializing each ColumnFamily in its entirety.  These could definitely be split into separate tickets.


> Proactive repair
> ----------------
>
>                 Key: CASSANDRA-193
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-193
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>             Fix For: 0.4
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is done.  This is better than nothing but is not sufficient for some cases (e.g. catastrophic node failure where you need to rebuild all of a node's data on a new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF data model but I suppose we could just hash the serialized CF value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.