You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by "Mitchell Rathbun (BLOOMBERG/ 731 LEX)" <mr...@bloomberg.net> on 2019/09/23 16:46:19 UTC

Leader Election issues on cluster restart

We are currently running a storm cluster on one machine. So there is one nimbus/supervisor instance in a given cluster. We have recently had issues where Nimbus was started and was unable to become leader. There were no other instances running at this time. The cluster we seemingly brought down successfully:

1609 2019-09-21 22:12:47,518 INFO  nimbus [Thread-7] Shutting down master
1610 2019-09-21 22:12:47,520 INFO  CuratorFrameworkImpl [Curator-Framework-0] backgroundOperationsLoop exiting
1611 2019-09-21 22:12:47,527 INFO  ZooKeeper [Thread-7] Session: 0x30000223e30079a closed
1612 2019-09-21 22:12:47,527 INFO  ClientCnxn [main-EventThread] EventThread shut down
1613 2019-09-21 22:12:47,528 INFO  CuratorFrameworkImpl [Curator-Framework-0] backgroundOperationsLoop exiting
1614 2019-09-21 22:12:47,533 INFO  ClientCnxn [main-EventThread] EventThread shut down
1615 2019-09-21 22:12:47,533 INFO  ZooKeeper [Thread-7] Session: 0x30000223e30079b closed
1616 2019-09-21 22:12:47,534 INFO  CuratorFrameworkImpl [Curator-Framework-0] backgroundOperationsLoop exiting
1617 2019-09-21 22:12:47,539 INFO  ClientCnxn [main-EventThread] EventThread shut down
1618 2019-09-21 22:12:47,539 INFO  ZooKeeper [Thread-7] Session: 0x30000223e300798 closed
1619 2019-09-21 22:12:47,539 INFO  nimbus [Thread-7] Shut down master


And then brought back up 20 minutes later. When brought up, we immediately started seeing:

2019-09-21 22:32:47,082 INFO  JmxPreparableReporter [main] Preparing...
2019-09-21 22:32:47,098 INFO  common [main] Started statistics report plugin...
2019-09-21 22:32:47,140 INFO  nimbus [main] Starting nimbus server for storm version '1.2.1'
2019-09-21 22:32:47,219 INFO  PlainSaslTransportPlugin [main] SASL PLAIN transport factory will be used
2019-09-21 22:32:47,858 INFO  nimbus [timer] not a leader, skipping assignments
2019-09-21 22:32:47,858 INFO  nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:32:47,860 INFO  nimbus [timer] not a leader, skipping credential renewal.
2019-09-21 22:32:49,134 INFO  AbstractSaslServerCallbackHandler [pool-14-thread-1] Successfully authenticated client: authenticationID = op authorizationID = op
2019-09-21 22:32:49,171 INFO  AbstractSaslServerCallbackHandler [pool-14-thread-2] Successfully authenticated client: authenticationID = op authorizationID = op
2019-09-21 22:32:57,858 INFO  nimbus [timer] not a leader, skipping assignments
2019-09-21 22:32:57,859 INFO  nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:33:07,860 INFO  nimbus [timer] not a leader, skipping assignments
2019-09-21 22:33:07,860 INFO  nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:33:17,862 INFO  nimbus [timer] not a leader, skipping assignments

followed shortly by:

2019-09-21 22:33:52,409 WARN  nimbus [pool-14-thread-7] Topology submission exception. (topology name='WingmanTopology4159') #error {
:cause not a leader, current leader is NimbusInfo{host='trslnydtraap01', port=30553, isLeader=true}
:via
 [{:type java.lang.RuntimeException
   :message not a leader, current leader is NimbusInfo{host='trslnydtraap01', port=30553, isLeader=true}
   :at [org.apache.storm.daemon.nimbus$is_leader doInvoke nimbus.clj 150]}]
:trace


What could cause this election issue? If no other leader processes are running or known in the cluster, I am assuming that some sort of cluster state was not cleaned up correctly, either in ZooKeeper or on disk. In general, how does Storm mark whether there is a leader or not in a cluster? What could be the cause of the issue posted above?