You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2010/06/22 16:28:56 UTC

[jira] Created: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

removetoken drops node from ring before re-replicating its data is finished
---------------------------------------------------------------------------

                 Key: CASSANDRA-1216
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Jonathan Ellis
             Fix For: 0.6.3


this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884714#action_12884714 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

A possible solution I see for this is to keep nodes in the justRemovedEndpoints map in Gossiper until we can verify that replication has completed.  I think we could accomplish verification through a callback on the replicate request.  I'm unsure about what data gets persisted so I don't know if a restart would wipe out the justRemovedEndpoints map

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899993#action_12899993 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

Yeah I wasn't really understanding that streaming/messaging code at all.

The current StreamOut implementation has a callback concept however.  I think this should be moved into the StreamContext object and then both StreamOut and StreamIn can perform callbacks on actual stream completion.  

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Fixes-to-old-tests.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898442#action_12898442 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

Re-rebased.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900015#action_12900015 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

The StreamOut callback works differently than the MessagingService callback.  Your approach sounds workable.  I don't think it matters where you push the callback to, so long as you make sure it gets executed after the stream is finished.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0003-Fixes-to-old-tests.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902511#action_12902511 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

Some questions about the coordinator...  I see that removeToken() is quasi-blocking now, like unbootstrap() (it was fire-and-forget before).  What are the consequences of the coordinator node going down?  Assuming a dead coordinator, would it be Bad for another node to remove-token on the same token while the transfers initiated by the original failed coordinator were in process?  Or assuming the transfers were finished, would a remove-token on a new coordinator generally do little other than get the state to LEFT?

I think I'm of the opinion that removeToken should either block until the transfer is complete (or failed), or should return instantly, and that we need to make sure that subsequent removeToken calls do not upset existing transfers.  Having it return error after a timeout (which is possible in the case of LOTS of data) makes me think we should be doing differently.

Or is the only recourse to repair?

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Additional-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Fixes-to-old-tests.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902523#action_12902523 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

I believe the only consequences of calling removeToken on another node when the coordinator goes down would be that the entire operation would be repeated. So any data that was transferred before would be transferred again.  I think this is the right behavior since there is no way of knowing what was transferred before the coordinator went down.  

It might be useful to add a 'force' option though.  If the coordinator goes down and the token gets stuck in a REMOVING state you may want to force removal rather than redoing the entire operation. 

It should be possible to remove the timeout so that removeToken blocks until the transfer is completely finished.  The code for streaming in the remote data blocks until all streams are complete and the code for sending a confirmation to the coordinator will keep retrying until it is received or the coordinator dies.  

I think this would work if a check was added so that you can only call removeToken a second time if the coordinator is down.  It wouldn't handle two calls that occurred before the state made its way through gossip though.  



> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch
                0002-Additional-tests-for-removeToken.patch

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch
                0002-Additional-tests-for-removeToken.patch

Patches:
 * 0001
 ** Modifies the removeToken operation to follow a pattern of NORMAL->REMOVING->LEFT, rather than the current pattern of a coordinator node setting its own status to a special cased version of NORMAL.
 ** Fixes a small bug in StreamHeader serialization
 ** Adds the ability to either get the status of a remove operation taking place or force a remove operation to finish immediately
 * 0002
 ** Tests for removing tokens
 ** Move shared code for creating a ring to Util class


Removal Process:
 * Normal Case
 *# Coordinator sets status of failed node to REMOVING
 *# Coordinator blocks on confirmation from other nodes
 *# Any newly responsible nodes stream data
 *# Newly responsible nodes send confirmation once all data has streamed
 *# Coordinator updates status of failed node to LEFT
 *# Done
 * Failure Cases
 ** Coordinator failure
 *** If the coordinator fails the remove operation will need to be retried
 *** This can be done on any node in the cluster.  
 **  Newly responsible node failure
 *** If a newly responsible node fails but comes back up, it should see the REMOVING status in gossip and restart the operation
 *** If a newly responsible node fails permanently or a streaming operation fails and the node stays up, the coordinator will block forever while waiting for confirmation.  The best solution is to force the remove operation to complete and then run repair on the failed node.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0003-Additional-unit-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0003-Additional-unit-tests-for-removeToken.patch

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch
                0002-Additional-tests-for-removeToken.patch

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901589#action_12901589 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

bq. (minor nit) I wish there were a way to assert that tmd.getLeavingNodes() actually has nodes in it.

This is what tmd.isLeaving() does

bq. testStartRemoving should assert preconditions before calling ss.onChange (it also makes the same assertion twice).

I'm not sure what preconditions you mean. I added an assertion to make sure there are no endpoints already leaving.

bq. SS.removeToken() shouldn't throw a RuntimeException,

Do you think the UnsupportedOperationExceptions should be removed as well? These existed previously.

I modified the callback support for streaming so that the code should wait for all streams to finish before confirming. I also added a reply to the ReplicationFinishedHandler so the IAsyncResult will be updated.  

Thoughts?

The timeout values for waiting on the latches still need to be updated.


> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Add-callbacks-to-streaming.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0003-Additional-unit-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch
                0002-Fixes-to-old-tests.patch
                0003-Additional-unit-tests-for-removeToken.patch

Rebased.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885730#action_12885730 ] 

Jonathan Ellis commented on CASSANDRA-1216:
-------------------------------------------

since all tokens will be propagated to all nodes (even ones brought up after the dead node went down), that's not a problem

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903538#action_12903538 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

After some more thinking I think there are two problems here.

 * The timeout for waiting on a stream to complete - An arbitrary timeout here is not the right way to do this. What we really need is the concept of stream progress. We should be able to verify that a stream is progressing or not and based on that retry it.  CASSANDRA-1438 kind of relates to this problem and could be modified to implement this.  

 * The timeout waiting for nodes to confirm replication - Ideally there could be no timeout here. The problem though is if a node that should be grabbing data goes down permanently, removeToken will wait forever.  I think it's reasonable to have some sort of timeout in this case. A log message/error can indicate which machines were being waited on for replication. An administrator should know if that machine went down or is still streaming. That will determine if repair needs to be run.  The alternative to this I guess would be periodically waking up and checking that the nodes we are waiting on are still alive.  That wouldn't be particularly hard to implement

I don't think returning immediately from the call is the right approach.  That is part of the reason why this ticket is created. In the case that replication fails somewhere, there is no feedback to the user.  At least timing out eventually provides information about which machines we think failed to replicate data.  

As far as multiple remove calls and the coordinator going down.  I think there should be a 'force' option in the case the coordinator goes down and you believe the rest of the nodes completed the operation.  To prevent multiple calls to removeToken there should just be a check to make sure the coordinator is dead before another call can be performed.

So besides those few changes above, I think we should either implement this part way with a time out for stream replication or postpone completion here until we add the concept of stream progress.


> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-command-to-make-it-similar-to-dec.patch
                0002-Fixes-to-old-tests.patch

* 0001 - changes to make removeToken behave similarly to decomission
* 0002 - fixes to existing tests since the state for STATE_LEFT changed

I am still working on some good unit tests for these changes but these are the changes so far.

The new process for removeToken is basically the one outlined above. One change is that instead of a STATE_REMOVED state it seemed like tokens that are removed should just go into STATE_LEFT similar to nodes that are decommissioned.

One thing I'm not sure of is the timeout values for waiting for replications to stream and for waiting for replication notifications. Currently they are just set arbitrarily in that patch. Need to determine good values for these.


> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>         Attachments: 0001-Modify-removeToken-command-to-make-it-similar-to-dec.patch, 0002-Fixes-to-old-tests.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-command-to-make-it-similar-to-dec.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch
                0002-Fixes-to-old-tests.patch
                0003-Additional-unit-tests-for-removeToken.patch

Some fixes and tests added.

There is one thing that still needs to be fixed.
 * Currently the call to removeToken blocks either:
 ** until all nodes confirm that they have replicated the data for the dead node.
 ** or a timeout is reached
 * I'm not sure what the timeout for this should be. Additionally when nodes throughout the ring attempt to replicate data there should be a similar timeout before they give up on a source and retry.
 * Also clients may timeout before the timeout is even reached or all the data is replicated. I'm not sure how the user will be able to determine if the remove finished correctly or repair should be run.  


> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Additional-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885653#action_12885653 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

It seems like this should follow a pattern similar to decommissioning a node.

* If nodeA has removeToken called on it, it becomes responsible for nodeB, the node to remove
* nodeA sets the MOVE_STATE of nodeB to STATE_REMOVING
* This is gossipped throughout the ring.
* Nodes see this change and fetch any ranges they are becoming responsible for
** After this is complete they will need to notify nodeA somehow that this is complete
* Once nodeA sees all replications have finished, change state of nodeB to STATE_REMOVED
* All nodes then remove nodeB from their ring.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-1216:
-------------------------------------

    Reviewer: gdusbabek

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902509#action_12902509 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

Re: timeouts

Yes I'm just not sure how to approach determining the right values for these.  Depends mostly on the amount of data and network bandwidth.

Re: RemoveTest

Yeah. The message sink in the test immediately responds to the stream request saying there are no files to stream.  This makes the StreamInManager think the data didn't exist remotely.  Doing it that way seems much easier than trying to make the test actually stream something.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey reassigned CASSANDRA-1216:
--------------------------------------

    Assignee: Nick Bailey

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898253#action_12898253 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

Nick, can you rebase?

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885702#action_12885702 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

A side effect of this approach may be that you would need to call removeToken on a node that had seen the token previously.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Fixes-to-old-tests.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884736#action_12884736 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

So clearly that solution would fail in the case of the node that is attempting to retrive the data failing.  Perhaps a better solution is simply not removing the node until replication and done.  Perhaps marking it with a new state?

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Add-callbacks-to-streaming.patch
                0002-Modify-removeToken-to-be-similar-to-decommission.patch
                0003-Fixes-to-old-tests.patch

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch

Updated 0001 patch. It was missing a class before. Oops.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0003-Additional-unit-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0003-Fixes-from-review.patch

1 and 2 are just errors on my part. I changed 3 because I was under the impression that a stream request to an endpoint that doesn't contain any of the ranges requested would create a header with null for pendingFiles.  I at first wrote one of the tests to behave like that, and got the NPE.  Looks like it changed or was never like that.

Fixed all that in a quick patch and attached it.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch, 0003-Fixes-from-review.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902503#action_12902503 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

I see this in RemoveTest:


 [junit] Testsuite: org.apache.cassandra.service.RemoveTest
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.97 sec
    [junit] 
    [junit] ------------- Standard Error -----------------
    [junit] ERROR 11:27:58,277 Did not find matching ranges on /127.0.0.6
    [junit] ERROR 11:27:58,279 Did not find matching ranges on /127.0.0.6
    [junit] ERROR 11:27:58,280 Did not find matching ranges on /127.0.0.5
    [junit] ERROR 11:27:58,280 Did not find matching ranges on /127.0.0.4
    [junit] ERROR 11:27:58,280 Did not find matching ranges on /127.0.0.2
    [junit] ERROR 11:27:59,264 Did not find matching ranges on /127.0.0.6
    [junit] ERROR 11:27:59,272 Did not find matching ranges on /127.0.0.6
    [junit] ERROR 11:27:59,276 Did not find matching ranges on /127.0.0.5
    [junit] ERROR 11:27:59,279 Did not find matching ranges on /127.0.0.4
    [junit] ERROR 11:27:59,283 Did not find matching ranges on /127.0.0.2
    [junit] ------------- ---------------- ---------------

Is that ok?

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913764#action_12913764 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

Ok this should be ready for review now.  The process is:

# Coordinator node modifies its own status to NORMAL - REMOVING to indicate which node is being removed
# Coordinator blocks on removal confirmaton from other nodes
# Newly responsible nodes see this status and begin fetching new data
# Newly responsible nodes notify coordinator they have replicated all data
# Coordinator node updates its own status to NORMAL - REMOVED to indicate the removal is complete
# This causes all nodes to remove the node from gossip/tokenmetadata. 
# Done

Tested this with a 3 node cluster in the cloud, as well as testing the new getStatus and forceRemoval operations.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0001-Modify-removeToken-to-be-similar-to-decommission.patch
                0002-Additional-tests-for-removeToken.patch

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885655#action_12885655 ] 

Jonathan Ellis commented on CASSANDRA-1216:
-------------------------------------------

agreed

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0001-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915227#action_12915227 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

This looks good.

1.  There were a few unused local variables in SS.retoreReplicationCount().  Was this just leftovers from a rebase?
2.  SS.handleStateRemoving removes a null check that previously existed for epThatLeft (renamed removeEndpoint).  Was the original null-check pointless or was something missed in the change?
3.  You made a change to StreamHeader that made me think you were running into cases where SH.pendingFiles == null.  Is that true?  Tracing the codepaths makes me think this is not possible.

Don't bother with the cleanup in 1.  I'm more curious about 2 and 3.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1216:
--------------------------------------

    Fix Version/s: 0.7
                       (was: 0.6.3)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment: 0004-Additional-tests-for-removeToken.patch

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913634#action_12913634 ] 

Nick Bailey commented on CASSANDRA-1216:
----------------------------------------

Bah.  Gossip marks the node alive when it receives an updated application state. Reverting it to modifying the coordinator nodes state.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915742#action_12915742 ] 

Hudson commented on CASSANDRA-1216:
-----------------------------------

Integrated in Cassandra #549 (See [https://hudson.apache.org/hudson/job/Cassandra/549/])
    changes update for CASSANDRA-1216
modify removetoken so that the coordinator relies on replicating nodes for updates. patch by Nick Bailey, reviewed by Gary Dusbabek. CASSANDRA-1216


> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch, 0003-Fixes-from-review.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902468#action_12902468 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

> Do you think the UnsupportedOperationExceptions should be removed as well? These existed previously.
My bad; I didn't notice that.  RTE was probably ok.

> I modified the callback support for streaming so that the code should wait for all streams to finish before confirming. I also added a reply to the ReplicationFinishedHandler so the IAsyncResult will be updated
First glance tells me this will work.  I'll run some tests after I'm done reviewing.

> The timeout values for waiting on the latches still need to be updated.
Is this coming in another patch?

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Modify-removeToken-to-be-similar-to-decommission.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899499#action_12899499 ] 

Gary Dusbabek commented on CASSANDRA-1216:
------------------------------------------

RemoveTest needs some cleanup.
* ReplicationSink doesn't need callCount
* NotificationSink doesn't need hitList
* testRemoveToken and testStartRemoving abuse Gossiper.start().  Consider adding a method to Gossiper that initializes the epstate for a given node.  E.g.: initializeNodeUnsafe(InetAddr addr, int generation).
* (minor nit) I wish there were a way to assert that tmd.getLeavingNodes() actually has nodes in it.
* all the methods throw UnknownHostException, but don't need to (IOException covers it)
* testStartRemoving should assert preconditions before calling ss.onChange (it also makes the same assertion twice).

StorageService:
* (minor nit) a comment describing the distinction between the leaving and removing constants.
* SS.removeToken() shouldn't throw a RuntimeException, as the client won't know what to make of it.  Declare an exception in the interface and throw it in the impl.  I imagine this will be a fairly common case (e.g.: when a node is down).
* SS.setReplicatingNodes and clearReplicatingNodes can be inlined into removeToken. It saves a few lines and obviates a local var.
* SS.replicateTables should probably be merged into SS.restoreReplicaCount.

Was the intent that SS.replicateTables block until the files are transferred?  Because it doesn't.  AFAICT it blocks until the first ack comes back from each source node, which is a good indication that streaming has started, but not that it is finished.

I couldn't verify that the callbacks are ever called.  That happens on the READ_RESPONSE stage and afaict, none of the streaming code path ever puts a task there.  That's a painful interface to follow though, so I might be wrong.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Fixes-to-old-tests.patch, 0003-Additional-unit-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0002-Additional-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Modify-removeToken-to-be-similar-to-decommission.patch, 0002-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Bailey updated CASSANDRA-1216:
-----------------------------------

    Attachment:     (was: 0004-Additional-tests-for-removeToken.patch)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1216:
--------------------------------------

    Fix Version/s: 0.7.0
                       (was: 0.7 beta 2)

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1216
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7.0
>
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch, 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
>
>
> this means that if something goes wrong during the re-replication (e.g. a source node is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart the process (other than the Big Hammer of running repair)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.