You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Marcus Olsson (JIRA)" <ji...@apache.org> on 2015/12/07 10:29:11 UTC

[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

    [ https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044637#comment-15044637 ] 

Marcus Olsson commented on CASSANDRA-10070:
-------------------------------------------

[~zemeyer] I've added the possibility to schedule a job remotely, so that one node can tell another node to run a certain job. Right now it's used for when a node discovers that another node has been down longer than the possible hint window, and then tells that node to repair it's ranges ASAP. The remote scheduling is using the distributed locking mechanism to avoid that multiple nodes try to tell the same node to run the repair at the same time.

So a simple flow could be:
Node A goes down at 12:00
Node B recognizes it and saves "Node A DOWN @ 12:00" locally
Node A comes back up at 16:00
Node B sees Node A as online again at 16:00 and sees that Node A has been down since 12:00, 4 hours.
Node B sends a repair job to Node A for each table that has a hint window that is 4 hours or less.
Node A runs all repairs

---

I'll continue to work on the feature of pausing all repairs and also the prevention mechanism. I've done some work for the prevention mechanism for jobs in that it checks the job history for repairs and only returns that it *can* run a repair if any range hasn't been repaired within the hint window (it's still based on the interval though, so the repair shouldn't run more than once per interval in the normal case).

To the prevention mechanism I should probably add a way for it to avoid doing multiple repairs for a single node at the same time. After that I'll add the possibility to run parallel repair tasks over the cluster.

---

The git branch is [here|https://github.com/emolsson/cassandra/commits/10070].

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a required task, but this can both be hard for new users and it also requires a bit of manual configuration. There are good tools out there that can be used to simplify things, but wouldn't this be a good feature to have inside of Cassandra? To automatically schedule and run repairs, so that when you start up your cluster it basically maintains itself in terms of normal anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)