You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Chris Burroughs (JIRA)" <ji...@apache.org> on 2013/08/21 18:22:52 UTC

[jira] [Created] (CASSANDRA-5915) node flapping prevents replace_node from succeeding consistently

Chris Burroughs created CASSANDRA-5915:
------------------------------------------

             Summary: node flapping prevents replace_node from succeeding consistently
                 Key: CASSANDRA-5915
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5915
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: 1.2.8
            Reporter: Chris Burroughs


A node was down for a week or two due to hardware disk failure. I tried to use replace_node to bring up a new node on the same physical host with the same IPs. (rbranson suspected that using the same IP may be more issue prone.) This failed due to "unable to find sufficient sources for streaming range"  See CASSANDRA-5913 for a problem with how the failure was handled by gossip.

All of the other nodes should have been up the entire time, but when this node came up it saw nodes flap up and down for quiet some time.  I was eventually able to get replace_token to work by adding a 60 (!) second sleep to StorageService:bootstrap.  I don't know if the right path is "why are things flapping so much" or "bootstrap should wait until things look stable".

A few notes about the cluster:
 * 2 dc cluster (about 20 each), using GossipingPropertyFileSnitch
 * multi-dc no vpn setup: http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3C51BF5C79.7020905@gmail.com%3E

Startup log from the successful (with sleep) replace_node attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira