You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Edward Capriolo (JIRA)" <ji...@apache.org> on 2010/06/01 18:11:36 UTC

[jira] Created: (CASSANDRA-1149) If node join fails process should recover or terminate

If node join fails process should recover or terminate
------------------------------------------------------

Key: CASSANDRA-1149
URL: https://issues.apache.org/jira/browse/CASSANDRA-1149
Project: Cassandra
Issue Type: Improvement
Affects Versions: 0.6.1
Reporter: Edward Capriolo

Being pro-active is great, but at times joining a node needs to be done when a cassandra cluster is overtaxed. A variety of (bad) things happen in this situation.

Scenario 1: NodeB joins cluster attempts to get TokenRange from NodeA. NodeA fails actually or high load causes the gossip of NodeB to detact NodeA as failed. NodeA will stay in bootstrap mode permanently.

Scenario 2: NodeB joins cluster attempts to get TokenRange from NodeA. Neither node will fail noticable but a stream will stall. NodeA will stay in bootstrap mode permanently.

Suggested feature wanted:

1. NodeB should give up and shutdown if streams fail. Currently user starts a streaming process and returns hours later, because no one is going to sit and watch. If I come back in a day and NodeB is down I know it failed I can try again. Currently I look at the cpu, streams on both nodes. Determine if the source node is compacting. Wait a while run streams again. No progress, restart.

2. Source node does not have the same (relevant) stream list as you do. In this case NodeA probably restarted. NodeB should restart bootstrap or terminate

3. No progress on streams . If streams are not progressing and Node A is not compacting/anti-compacting. NodeB should give up

4. A possible solution would be to give each transfer a UUID, and if A dies, then B will restart that session if A hasn't heard of the uuid

It would be great if long running multi-step processes like a move could restart after failures, automatically without returning to the beginning of the operation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1149) If node join fails process should recover or terminate

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Capriolo updated CASSANDRA-1149:
---------------------------------------

    Description: 
Being pro-active is great, but at times joining a node needs to be done when a cassandra cluster is overtaxed. A variety of (bad) things happen in this situation.

Scenario 1: NodeB joins cluster attempts to get TokenRange from NodeA. NodeA fails or high load causes the gossip of NodeB to detect NodeA as failed. NodeB will stay in bootstrap mode permanently.  

Scenario 2: NodeB joins cluster and attempts to get range from NodeA. Neither node will fail but a stream will stall. NodeB will stay in bootstrap mode permanently. 

Suggested feature wanted:

1. NodeB should give up and shutdown if streams fail. 

Currently user starts a streaming process and returns hours later no one is going to sit and watch. If user comes back in a day and NodeB is down they can try again. 

Currently user has to look at the cpu, streams on both nodes. Determine if the source node is compacting, wait a while run streams again. No progress, restart.

2. Source node does not have the same (relevant) stream list as you do. NodeA probably restarted. NodeB should restart bootstrap or terminate 

3. No progress on streams . If streams are not progressing and Node A is not compacting/anti-compacting. NodeB should shutdown.

4. A possible solution would be to give each transfer a UUID, and if A dies, then B will restart that session if A hasn't heard of the uuid

It would be great if long running multi-step processes like a move could restart automatically without returning to the beginning of the operation.

  was:
Being pro-active is great, but at times joining a node needs to be done when a cassandra cluster is overtaxed. A variety of (bad) things happen in this situation.

Scenario 1: NodeB joins cluster attempts to get TokenRange from NodeA. NodeA fails actually or high load causes the gossip of NodeB to detact NodeA as failed. NodeA will stay in bootstrap mode permanently.  

Scenario 2: NodeB joins cluster attempts to get TokenRange from NodeA. Neither node will fail noticable but a stream will stall. NodeA will stay in bootstrap mode permanently. 

Suggested feature wanted:

1. NodeB should give up and shutdown if streams fail. Currently user starts a streaming process and returns hours later, because no one is going to sit and watch. If I come back in a day and NodeB is down I know it failed I can try again. Currently I look at the cpu, streams on both nodes. Determine if the source node is compacting. Wait a while run streams again. No progress, restart.

2. Source node does not have the same (relevant) stream list as you do. In this case NodeA probably restarted. NodeB should restart bootstrap or terminate 

3. No progress on streams . If streams are not progressing and Node A is not compacting/anti-compacting. NodeB should give up

4. A possible solution would be to give each transfer a UUID, and if A dies, then B will restart that session if A hasn't heard of the uuid

It would be great if long running multi-step processes like a move could restart after failures, automatically without returning to the beginning of the operation.


> If node join fails process should recover or terminate
> ------------------------------------------------------
>
>                 Key: CASSANDRA-1149
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1149
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.6.1
>            Reporter: Edward Capriolo
>
> Being pro-active is great, but at times joining a node needs to be done when a cassandra cluster is overtaxed. A variety of (bad) things happen in this situation.
> Scenario 1: NodeB joins cluster attempts to get TokenRange from NodeA. NodeA fails or high load causes the gossip of NodeB to detect NodeA as failed. NodeB will stay in bootstrap mode permanently.  
> Scenario 2: NodeB joins cluster and attempts to get range from NodeA. Neither node will fail but a stream will stall. NodeB will stay in bootstrap mode permanently. 
> Suggested feature wanted:
> 1. NodeB should give up and shutdown if streams fail. 
> Currently user starts a streaming process and returns hours later no one is going to sit and watch. If user comes back in a day and NodeB is down they can try again. 
> Currently user has to look at the cpu, streams on both nodes. Determine if the source node is compacting, wait a while run streams again. No progress, restart.
> 2. Source node does not have the same (relevant) stream list as you do. NodeA probably restarted. NodeB should restart bootstrap or terminate 
> 3. No progress on streams . If streams are not progressing and Node A is not compacting/anti-compacting. NodeB should shutdown.
> 4. A possible solution would be to give each transfer a UUID, and if A dies, then B will restart that session if A hasn't heard of the uuid
> It would be great if long running multi-step processes like a move could restart automatically without returning to the beginning of the operation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.