You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Matthew F. Dennis (JIRA)" <ji...@apache.org> on 2010/10/28 00:28:21 UTC

[jira] Created: (CASSANDRA-1670) cannot decom a node then bring it back to the cluster

cannot decom a node then bring it back to the cluster
-----------------------------------------------------

                 Key: CASSANDRA-1670
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.7 beta 2
         Environment: RAX
            Reporter: Matthew F. Dennis


two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.

decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.

One of two things happen:

* node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
* both node0 and node1 think they are in rings by themselves

If you restart node0 after decom, it appears to work normally.

Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927025#action_12927025 ] 

Jonathan Ellis commented on CASSANDRA-1670:
-------------------------------------------

how does moving the removal out of the for loop fix the state-attached-to-it problem?

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot decom a node then bring it back to the cluster

Posted by "Mike Bulman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926469#action_12926469 ] 

Mike Bulman commented on CASSANDRA-1670:
----------------------------------------

Because move is decommission+bootstrap, the same behavior occurs when moving node1 as well.

> cannot decom a node then bring it back to the cluster
> -----------------------------------------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Priority: Minor
>             Fix For: 0.7.1
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927661#action_12927661 ] 

Jonathan Ellis commented on CASSANDRA-1670:
-------------------------------------------

bq. The old code will only remove a node from justRemovedEndpoints_ if it currently exists in endpointStateMap_

Isn't it "remove from jRE if _any_ [other] node exists in eSM?"  Which means this is only a bug in 2-node clusters?

+1 if so, just trying to understand.

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Mike Bulman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927652#action_12927652 ] 

Mike Bulman commented on CASSANDRA-1670:
----------------------------------------

Ok got it working properly.  Patch fixes the issue described.

As a node, that patch doesn't work in 0.7 branch because justRemovedEndPoints_ (.6) is now justRemovedEndpoints (.7).  Not sure how you guys handle that, but the change is simple enough.

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-1670:
-------------------------------------

    Fix Version/s: 0.7.0

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-1670:
-------------------------------------

    Attachment: 1670-0.6.txt

The code that removes endpoints from Gossiper.justRemovedEndpoints after RING_DELAY was only getting called if the endpoint had a state attached to it.  Since state is removed for decommissioned nodes, the code was never getting called. 

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7
>
>         Attachments: 1670-0.6.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927034#action_12927034 ] 

Gary Dusbabek commented on CASSANDRA-1670:
------------------------------------------

When a node is decommissioned, it gets added to justRemovedEndpoints_, but removed from endpointStateMap_.  The old code will only remove a node from justRemovedEndpoints_ if it currently exists in endpointStateMap_.  If the node stays in justRemovedEndpoints_ (which it will currently), it can never be recognized as part of the ring because of the check in Gossiper.handleNewJoin().

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Mike Bulman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927198#action_12927198 ] 

Mike Bulman commented on CASSANDRA-1670:
----------------------------------------

Running from nodetool as well as ripcord decommission code (direct call to StorageService) gets:

Exception in thread "main" java.lang.AssertionError
        at org.apache.cassandra.service.StorageService.getLocalToken(StorageService.java:1128)
        at org.apache.cassandra.service.StorageService.startLeaving(StorageService.java:1527)
        at org.apache.cassandra.service.StorageService.decommission(StorageService.java:1546)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:111)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:45)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:226)
        at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
        at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:251)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:857)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:795)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1450)
        at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:90)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1285)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1383)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:807)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        at sun.rmi.transport.Transport$1.run(Transport.java:177)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:553)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:808)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:667)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927840#action_12927840 ] 

Gary Dusbabek commented on CASSANDRA-1670:
------------------------------------------

bq. sn't it "remove from jRE if any [other] node exists in eSM?" Which means this is only a bug in 2-node clusters?
Yes. good observation.  I'll only bother committing this to 0.7/trunk then.

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Mike Bulman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927463#action_12927463 ] 

Mike Bulman commented on CASSANDRA-1670:
----------------------------------------

That's all I get.  The only other thing I can add is that the node being decommissioned logs "DECOMMISSIONING" when level is set to DEBUG, and that the exception comes back almost immediately.  fwiw, I'm running this on r1029870 of the .7 branch

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek reassigned CASSANDRA-1670:
----------------------------------------

    Assignee: Gary Dusbabek

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.7.0
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927471#action_12927471 ] 

Gary Dusbabek commented on CASSANDRA-1670:
------------------------------------------

There should be more. What I'm looking for is some indication that the node is not in the middle of a bootstrap operation, which would trigger this exception.

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Mike Bulman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927484#action_12927484 ] 

Mike Bulman commented on CASSANDRA-1670:
----------------------------------------

INFO 15:48:22,176 Joining: getting load information                                                                                                                                                                        
 INFO 15:48:22,177 Sleeping 90000 ms to wait for load information...
DEBUG 15:48:23,053 GC for ParNew: 12 ms, 15822112 reclaimed leaving 15102720 used; max is 1268449280                                                                                                                        
DEBUG 15:48:23,166 attempting to connect to /184.106.231.n0
 INFO 15:48:23,553 Node /184.106.231.n0 is now part of the cluster                                                                                                                                                         
DEBUG 15:48:23,554 Resetting pool for /184.106.231.n0
DEBUG 15:48:23,559 Node /184.106.231.n0 state normal, token 104110673354167092736227093944218730763                                                                                                                        
DEBUG 15:48:23,559 New node /184.106.231.n0 at token 104110673354167092736227093944218730763
DEBUG 15:48:23,559 clearing cached endpoints                                                                                                                                                                                
DEBUG 15:48:24,167 attempting to connect to /184.106.231.n0
DEBUG 15:48:24,169 Disseminating load info ...                                                                                                                                                                              
 INFO 15:48:24,554 InetAddress /184.106.231.n0 is now UP
 INFO 15:48:24,554 Started hinted handoff for endpoint /184.106.231.n0                                                                                                                                                  
 INFO 15:48:24,557 Finished hinted handoff of 0 rows to endpoint /184.106.231.n0
DEBUG 15:49:24,169 Disseminating load info ...                                                                                                                                                                              
DEBUG 15:49:39,106 GC for ParNew: 16 ms, 16111512 reclaimed leaving 85537648 used; max is 1268449280
DEBUG 15:49:52,177 ... got load info                                                                                              
 INFO 15:49:52,177 Joining: getting bootstrap token
DEBUG 15:49:52,183 attempting to connect to /184.106.231.n0                                                                                                                                                                
DEBUG 15:49:52,191 Processing response on a callback from 270@/184.106.231.n0
 INFO 15:49:52,192 New token will be 19040081623932476870383442086276677899 to assume load from /184.106.231.n0                                                                                                             
DEBUG 15:49:52,193 clearing cached endpoints
DEBUG 15:49:52,194 Will try to load mx4j now, if it's in the classpath                                                                                                                                                      
 INFO 15:49:52,194 Will not load MX4J, mx4j-tools.jar is not in the classpath
 INFO 15:49:52,220 Binding thrift service to /0.0.0.0:9160                                                                                                                                                                  
 INFO 15:49:52,222 Using TFramedTransport with a max frame size of 15728640 bytes.
 INFO 15:49:52,226 Listening for thrift clients...                                                                                                                                                                          
DEBUG 15:50:24,170 Disseminating load info ...
DEBUG 15:50:52,247 DECOMMISSIONING                                                                                                                                                                                          
DEBUG 15:51:24,170 Disseminating load info ...
DEBUG 15:52:24,171 Disseminating load info ...                                                                                                                                                                              


>From n0:

root@ripcord:/usr/src/cassandra/branches/cassandra-0.7# bin/nodetool -h 184.106.231.n0  ring                                                                                                                           
Address         Status State   Load            Token
                                       104110673354167092736227093944218730763                                                                                                                                             
184.106.228.n1 Up     Normal  5.3 KB          19040081623932476870383442086276677899
184.106.231.n0 Up     Normal  10.27 KB        104110673354167092736227093944218730763                                                                                                                                     

root@ripcord:/usr/src/cassandra/branches/cassandra-0.7# bin/nodetool -h 184.106.228.n1 decommission 
     <TRACE FROM MY PREVIOUS COMMENT>

root@ripcord:/usr/src/cassandra/branches/cassandra-0.7# bin/nodetool -h 184.106.231.n0 ring
Address         Status State   Load            Token                                       
                                       104110673354167092736227093944218730763    
184.106.228.n1 Up     Normal  5.3 KB          19040081623932476870383442086276677899      
184.106.231.n0 Up     Normal  10.27 KB        104110673354167092736227093944218730763     


> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927422#action_12927422 ] 

Gary Dusbabek commented on CASSANDRA-1670:
------------------------------------------

Mike, can you provide a more complete log?  That trace is unrelated to the patch and likely indicates a different problem.

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-1670:
-------------------------------------

    Attachment: v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7, 0.7.0
>
>         Attachments: 1670-0.6.txt, v1-0001-code-that-tidied-Gossiper.justRemovedEndpoints_-was-no.txt
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1670) cannot decom a node then bring it back to the cluster

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1670:
--------------------------------------

             Priority: Minor  (was: Major)
    Affects Version/s:     (was: 0.7 beta 2)
        Fix Version/s: 0.7.1

> cannot decom a node then bring it back to the cluster
> -----------------------------------------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Priority: Minor
>             Fix For: 0.7.1
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1670) cannot move a node

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1670:
--------------------------------------

      Component/s: Core
         Priority: Major  (was: Minor)
    Fix Version/s:     (was: 0.7.1)
                   0.7.0
          Summary: cannot move a node  (was: cannot decom a node then bring it back to the cluster)

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>             Fix For: 0.7.0
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1670) cannot move a node

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-1670:
-------------------------------------

    Affects Version/s: 0.6.6
        Fix Version/s:     (was: 0.7.0)
                       0.6.7

Looks like this bug goes all the way back to 0.6.

> cannot move a node
> ------------------
>
>                 Key: CASSANDRA-1670
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1670
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.6
>         Environment: RAX
>            Reporter: Matthew F. Dennis
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.7
>
>
> two node cluster (node0, node1).  node0 is listed as the only seed on both nodes.  Listen addresses explicitly set to an IP on both nodes. No initial token, no autobootstrap (but see below).  Bring up the ring.  Everything is fine on both nodes.
> decom node1.  verify decom completed correctly by reading the logs on both nodes.  rm all data/logs on node1.  bring node1 up again.
> One of two things happen:
> * node0 thinks it is in a ring by itself, node1 thinks both nodes are in the ring.
> * both node0 and node1 think they are in rings by themselves
> If you restart node0 after decom, it appears to work normally.
> Similar issues seem to present if you kill node1 (either when autobootstrapping before it completes or after it is in the ring) and removetoken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.