You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2016/08/05 22:47:20 UTC

[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

    [ https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410230#comment-15410230 ] 

Paulo Motta commented on CASSANDRA-8911:
----------------------------------------

Had an initial look at the WIP branch and overall I think the approach and implementation looks good and not very far from being ready to move into testing/benchmarks, great job! See some preliminary comments below:
* I don't entirely get why we need a separate HUGE response, can't we just page back the data from the mismatching replica in the MBRResponse similar to ordinary responses?
* While it's handy to have a rows/s metric for repaired rows, I think it might be more useful for operators to have a repair throttle knob in MB/S (in line with compaction/stream throughput) rather than rows/s, since the load imposed by a repairing row can vary between different tables, so it might be a bit trickier to do capacity planning based on rows/s.
* Currently we're throttling only at the coordinator, but it would probably be interesting to use the same rate limiter to also throttle participant reads/writes.
* Can we make MBROnHeapUnfilteredPartitions a lazy iterator that caches rows while its traversed?
** Also, can we leverage MBROnHeapUnfilteredPartitions (or similar)  to cache iterated rows on MBRVerbHandler?
* Try to generify unfiltered and filtered methods/classes into common classes/methods when possible (UnfilteredPager, UnfilteredPagersIterator, getUnfilteredRangeSlice, etc) (this is probably on your TODO but just a friendly reminder)
* Right now the service runs a full repair for a table which probably makes sense for triggered repairs, but when we make that continuous we wil probably need to interleave subrange repair of different tables to ensure fairness (so big tables don't starvate small table repairs).
* While repairing other replicas right away will probably get nodes consistent faster, I wonder if we can make any simplification by repairing only the coordinator node but since the diffs are already there I don't see any reason not to do it.
* We should probably batch read-repair mutations to other replicas, but we can maybe do that in a separate ticket.

Minor nits:
* It would probably be good to bound number of retries and throw an exception if that is exceeded
* On MBRDataRangeTest.wrappingTest, I didn't get why are keys (1,0)..(1,9) not returned?

I think the initial dtests look good, would be nice to perhaps extend with tests exercising collection and/or range tombstones. It would be maybe good to add unit tests exercising some edge cases, like repair pages ending before a partition key is finished (probably with the help of the mocking classes introduced by CASSANDRA-12016).

For the performance tests, it would probably be good to compare execution time and repair efficiency (stored rows/mismatched rows) of full repair with little and many mismatching rows, with scenarios with and without traffic in the background. We can probably decrease the write timeout and disable hinted handoff to cause dropped mutations in order to introduce some noise.

Nice work, please let me know if I can help in any of the remaining tasks or tests.

> Consider Mutation-based Repairs
> -------------------------------
>
>                 Key: CASSANDRA-8911
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tyler Hobbs
>            Assignee: Marcus Eriksson
>             Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming repair.  While we're at it, we could do away with a lot of the complexity around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" one-by-one.  Instead of dealing with token ranges, make the leaves primary key ranges.  The PK ranges would need to be contiguous, so that the start of each range would match the end of the previous range. (The first and last leaves would need to be open-ended on one end of the PK range.) This would be similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK range (using a LIMIT of the row count + 1) and compare hashes (unless the row counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be skipped.
>  * To support incremental repair, we need to be able to read from only repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)