You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jason Brown (JIRA)" <ji...@apache.org> on 2013/10/15 14:32:42 UTC

[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

    [ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795120#comment-13795120 ] 

Jason Brown edited comment on CASSANDRA-5351 at 10/15/13 12:31 PM:
-------------------------------------------------------------------

Interesting ideas here. However, here's some problems off the top of my head that need to be addressed (in no order):

* Nodes that are fully replaced (see: Netflix running in the cloud). When a node is replaced, we bootstrap the node by streaming data from closest peers (usually) in the local DC. The new node would not have anti-compacted sstables, as it's never had a chance to repair. I'm not sure if bootstrapping data can be considered anti-compacted through cummutativity; it might be true, but I'd need to think about it more. Assuming not, when this new node is involved in any repair, it would generate a different MT than it's already repaired peers, and thus all hell would break loose streaming already repaired data to every node involved in the repair, worse than today's repair (think streaming TBs of data across multiple amazon datacenters). If we can prove that the new node's data is commutatively repaired just by bootstrap, then this is not a problem as such. Note this also affects move (to a lesser degree) and rebuild.
* Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails to repair with them (due to partitioning, app crash, etc) during the repair. C is forced to do an -ipr repair as A and B have already anti-compacted and that is the only way C will be able to repair against A and B. 
* If the operator chooses to cancel the repair, we are left at an indeterminate state wrt which node has successfully completed repairs with another (similar to last point).
* Local DC repair vs. global is largely incompatible with this. Looks like you will get one shot with each sstable's range for repair, so if you choose do local DC repair with an ssttable, you are forced to do -ipr if you later want to globally repair.

Note that these problems are magnified immensely when you run in multiple datacenters, especially datacenters separated by great distances.

While none of these situations is unsolvable, it seems that there are many non-obvious ways into which we can get into a non-deterministic state that operators will see either tons of data being streamed due to different anti-compaction points being different or will be forced to run -ipr without an easily understood reason. I already see operators terminate repair jobs because "they hang" or "take too long", for better or worse (mostly worse). At that point, the operator is pretty much required to do an -ipr repair, which gets us back into the same situation we are in today, but with more confusion and possibly using -ipr as the default.

It would probably be good to run -ipr as a best practice anyways every n days/weeks/months, to help with bit rot. I worry about the very non-obvious edge cases ticket introduces and the possibility that operators will simply fall back to using -ipr whenever something goes bump or doesn't make sense.

Thanks for listening.


was (Author: jasobrown):
Interesting ideas here. However, here's some problems off the top of my head that need to be addressed (in no order):

* Nodes that are fully replaced (see: Netflix running in the cloud). When a node is replaced, we bootstrap the node by streaming data from closest peers (usually) in the local DC. The new node would not have anti-compacted sstables, as it's never had a chance to repair. I'm not sure if bootstrapping data can be considered anti-compacted through cummutativity; it might be true, but I'd need to think about it more. Asuuming not, when this new node is involved in any repair, it would generate a different MT than it's already repaired peers, and thus all hell would break loose streaming already repaired data to every node involved in the repair, worse than today's repair (think streaming TBs of data across multiple amazon datacenters). If we can prove that the new node's data is commutatively repaired just by bootstrap, then this is not a problem as such. Note this also affects move (to a lesser degree) and rebuild.
* Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails to repair with them (due to partitioning, app crash, etc) during the repair. C is forced to do an -ipr repair as A and B have already anti-compacted and that is the only way C will be able to repair against A and B. 
* If the operator chooses to cacncel the repair, we are left at an indetermant state wrt which node has successfully completed repairs with another (similar to last point).
* Local DC repair vs. global is largely incompatible with this. Looks like you will get one shot with each sstable's range for repair, so if you choose do local DC repair with an ssttable, you are forced to do -ipr if you later want to globally repair.

Note that these problems are magnified immensely when you run in multiple datacenters, especially datacenters separated by great distances.

While none of these situations is unresolvable, it seems that there are many non-obvious ways into which we can get into a non-deterministic state that operators will see either tons of data being streamed due to different anti-compaction points being different or will be forced to run -ipr without an easily understood reason. I already see operators terminate repair jobs because "they hang" or "take too long", for better or worse (mostly worse). At that point, the operator is pretty much required to do an -ipr repair, which gets us back into the same situation we are in today, but with more confusion and possibly using -ipr as the default.

It would probably be good to run -ipr as a best practice anyways every n days/weeks/months, but I worry about the very non-obvious edge cases this introduces and the possiblity that operators will simply fall back to using -ipr whenever something goes bump or doesn't make sense.

Thanks for listening.

> Avoid repairing already-repaired data by default
> ------------------------------------------------
>
>                 Key: CASSANDRA-5351
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Lyuben Todorov
>              Labels: repair
>             Fix For: 2.1
>
>
> Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient.
> We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair.  (This automatically makes CASSANDRA-3362 much less of a problem too.)
> The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired.  So we should segregate unrepaired sstables from the repaired ones.



--
This message was sent by Atlassian JIRA
(v6.1#6144)