You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Anuj Wadehra <an...@yahoo.co.in> on 2016/01/15 19:06:41 UTC

Repair when a replica is Down

Hi
We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a replica is down.
I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas.
I have following concerns with the approach:
We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes.
Then there is a dilemma:Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes?
This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node before gc grace period.
OR
Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes.
OR
Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios.

I need to understand the recommeded approach too for maintaing a fault tolerant system which can handle such node failures without hiccups.

ThanksAnuj

Re: Repair when a replica is Down

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Hi
I have intentionally posted this message to the dev mailing list instead of users list because its regarding a conscious design decision taken regarding a bug and I feel that dev team is the most appropriate team who could respond to it. Please let me know if there are better ways to get it addressed.
ThanksAnuj
Sent from Yahoo Mail on Android

On Fri, 15 Jan, 2016 at 11:36 pm, Anuj Wadehra<an...@yahoo.co.in> wrote: Hi
We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a replica is down.
I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas.
I have following concerns with the approach:
We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes.
Then there is a dilemma:Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes?
This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node before gc grace period.
OR
Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes.
OR
Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios.

I need to understand the recommeded approach too for maintaing a fault tolerant system which can handle such node failures without hiccups.

ThanksAnuj

Re: Repair when a replica is Down

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Actually I have not checked how repair -pr abort logic is implemented in code. So irrespective of repair pr or full repair scenarios, problem can be stated as follows:
20 node cluster, RF=5, Read/Write Quorum, gc grace period=20. If a node goes down, 1/20 th of data for which the failed node was responsible(owner) cant be repaired as 1 out of 5 replicas is down. This will put entire system health in question just because of single node failure.

ThanksAnuj

Sent from Yahoo Mail on Android

On Tue, 19 Jan, 2016 at 11:12 pm, Anuj Wadehra<an...@yahoo.co.in> wrote: Hi Tyler,
I think the scenario needs some correction. 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. If a node goes down, repair -pr would fail on 4 nodes maintaining replicas and full repair would fail on even greater no.of number of nodes but not 19. Please confirm.
Anyways the system health would get impacted as multiple nodes are not repairing with a single node failure.
ThanksAnujSent from Yahoo Mail on Android

On Tue, 19 Jan, 2016 at 10:48 pm, Anuj Wadehra<an...@yahoo.co.in> wrote: There is a JIRA
Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 .
But its open with Minor prority and type as Improvement. I think its a very valid concern for all and especially for users who have bigger clusters. More of an issue related with Design decision rather than an improvement. Can we change its priority so that it gets appropriate attention?

ThanksAnuj

On Tue, 19 Jan, 2016 at 10:35 pm, Tyler Hobbs<ty...@datastax.com> wrote:
On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra <an...@yahoo.co.in> wrote:

Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes.
What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system.

That makes sense. It seems like having the option to ignore down replicas during repair could be at least somewhat helpful, although it may be tricky to decide how this should interact with incremental repairs. If there isn't a jira ticket for this already, can you open one with the scenario above?

--
Tyler Hobbs
DataStax

Re: Repair when a replica is Down

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Hi Tyler,
I think the scenario needs some correction. 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. If a node goes down, repair -pr would fail on 4 nodes maintaining replicas and full repair would fail on even greater no.of number of nodes but not 19. Please confirm.
Anyways the system health would get impacted as multiple nodes are not repairing with a single node failure.
ThanksAnujSent from Yahoo Mail on Android

ThanksAnuj

On Tue, 19 Jan, 2016 at 10:35 pm, Tyler Hobbs<ty...@datastax.com> wrote:
On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra <an...@yahoo.co.in> wrote:

--
Tyler Hobbs
DataStax

Re: Repair when a replica is Down

Posted by Anuj Wadehra <an...@yahoo.co.in>.

There is a JIRA
Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 .
But its open with Minor prority and type as Improvement. I think its a very valid concern for all and especially for users who have bigger clusters. More of an issue related with Design decision rather than an improvement. Can we change its priority so that it gets appropriate attention?

ThanksAnuj

On Tue, 19 Jan, 2016 at 10:35 pm, Tyler Hobbs<ty...@datastax.com> wrote:
On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra <an...@yahoo.co.in> wrote:

--
Tyler Hobbs
DataStax

Re: Repair when a replica is Down

Posted by Tyler Hobbs <ty...@datastax.com>.

On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra <an...@yahoo.co.in>
wrote:

>
> Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write
> Quorum, gc grace period=20. My cluster is fault tolerant and it can afford
> 2 node failure. Suddenly, one node goes down due to some hardware issue.
> Its 10 days since my node is down, none of the 19 nodes are being repaired
> and now its decision time. I am not sure how soon issue would be fixed may
> be 8 days before gc grace, so I shouldnt remove node early and add node
> back as it would cause unnecessary streaming. At the same time, if I dont
> remove the failed node, my entire system health would be in question and it
> would be a panic situation as no data got repaired in last 10 days and gc
> grace is approaching. I need sufficient time to repair 19 nodes.
>
> What looked like a fault tolerant system which can afford 2 node failure,
> required urgent attention and manual decision making when a single node
> went down. Why cant we just go ahead and repair remaining replicas if some
> replicas are down? If failed node comes up before gc grace period, we would
> run repair to fix inconsistencies and otheriwse we would discard data and
> bootstrap. I think that would be a really robust fault tolerant system.
>

That makes sense.  It seems like having the option to ignore down replicas
during repair could be at least somewhat helpful, although it may be tricky
to decide how this should interact with incremental repairs.  If there
isn't a jira ticket for this already, can you open one with the scenario
above?


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Repair when a replica is Down

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Thanks Tyler !!
I understand that we need to consider a node as lost when its down for gc grace and bootstrap it. My question is more about the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a single replica is down. Precisely, I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. As it is related to a specific fix, I thought that developers involved in the decision could better explain the reasoning. So, I posted it on dev list first.
Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes.
What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system.

ThanksAnuj

  On Tue, 19 Jan, 2016 at 9:44 pm, Tyler Hobbs<ty...@datastax.com> wrote:   On Fri, Jan 15, 2016 at 12:06 PM, Anuj Wadehra <an...@yahoo.co.in>
wrote:

> Increase the gc grace period temporarily. Then we should have capacity
> planning to accomodate the extra storage needed for extra gc grace that may
> be needed in case of node failure scenarios.

I would do this.  Nodes that are down for longer than gc_grace_seconds
should not re-enter the cluster, because they may contain data that has
been deleted and the tombstone has already been purged (repairing doesn't
change this).  Bringing them back up will result in "zombie" data.

Also, I do think that the user mailing list is a better place for the first
round of this conversation.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Repair when a replica is Down

Posted by Tyler Hobbs <ty...@datastax.com>.

On Fri, Jan 15, 2016 at 12:06 PM, Anuj Wadehra <an...@yahoo.co.in>
wrote:

> Increase the gc grace period temporarily. Then we should have capacity
> planning to accomodate the extra storage needed for extra gc grace that may
> be needed in case of node failure scenarios.

I would do this.  Nodes that are down for longer than gc_grace_seconds
should not re-enter the cluster, because they may contain data that has
been deleted and the tombstone has already been purged (repairing doesn't
change this).  Bringing them back up will result in "zombie" data.

Also, I do think that the user mailing list is a better place for the first
round of this conversation.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>