You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Bowen Zhou (JIRA)" <ji...@apache.org> on 2016/08/30 18:32:20 UTC

[jira] [Created] (CURATOR-346) LeaderSelector getParticipants reports incorrect number of nodes after connection is suspended

Bowen Zhou created CURATOR-346:
----------------------------------

             Summary: LeaderSelector getParticipants reports incorrect number of nodes after connection is suspended
                 Key: CURATOR-346
                 URL: https://issues.apache.org/jira/browse/CURATOR-346
             Project: Apache Curator
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Bowen Zhou
            Assignee: Jordan Zimmerman


This is potentially a bug in Curator 2.7.1 where LeaderSelector's
getParticipants() reports one less participants than the actual number. On the
node missing from the result of getParticipants(), I found something strange in
the logs below: ClientCnxn reset the zookeeper connection, then the SUSPENDED
state was delivered and the leadership was interrupted and the participant
recreated a lock node and called Object.wait() to sleep and wait for its turn
to become the leader. However, the RECONNECTED state was not delivered until
ConnectionState.checkTimeouts() reset the zookeeper connection again, which
also had the side effect of deleting the ephemeral lock znode created earlier
by the leader selection participant. Now, we are missing the participant
permanently since it is sleeping in wait() and watching only the previous lock
node.

Callstack
{noformat}
“Curator-LeaderSelector-0" #319 daemon prio=5 os_prio=0 tid=0x00007f892408f800 nid=0x491f in Object.wait() [0x00007f88f4eb2000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.curator.framework.recipes.locks.LockInternals.internalLockLoop(LockInternals.java:307)
        - locked <0x000000059ee53af0> (a org.apache.curator.framework.recipes.locks.LockInternals)
        at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:217)
        at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:232)
        at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:386)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:443)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:64)
        at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:245)
        at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:239)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

Log
{noformat}
2016/08/29 16:46:49.478 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1255.prod.acompany.com:12916)] [searcher] [] Client session timed out, have not heard from server in 26680ms for sessionid 0x255531b1ff4732e, closing socket connection and attempting reconnect
2016/08/29 16:46:49.796 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1215.prod.acompany.com:12916)] [searcher] [] Opening socket connection to server lva1-app1215.prod.acompany.com/10.132.48.90:12916. Will not attempt to authenticate using SASL (unknown error)
2016/08/29 16:46:49.797 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1215.prod.acompany.com:12916)] [searcher] [] Socket connection established to lva1-app1215.prod.acompany.com/10.132.48.90:12916, initiating session
2016/08/29 16:46:49.800 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1215.prod.acompany.com:12916)] [searcher] [] Session establishment complete on server lva1-app1215.prod.acompany.com/10.132.48.90:12916, sessionid = 0x255531b1ff4732e, negotiated timeout = 40000
2016/08/29 16:46:49.801 INFO [ConnectionStateManager] [SomeService STARTING-EventThread] [searcher] [] State change: SUSPENDED
2016/08/29 16:46:49.803 INFO [MinReplicasSemaphore] [Curator-LeaderSelector-0] [searcher] [] I'm no longer the leader .. currently there are 1 permits
2016/08/29 16:47:04.809 ERROR [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection timed out for connection string (lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916) and timeout (15000) / elapsed (15002)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.readValue(SharedValue.java:244) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.access$100(SharedValue.java:44) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue$1.process(SharedValue.java:61) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) ~[curator-framework-2.7.1.jar:?]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
2016/08/29 16:47:20.812 ERROR [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection timed out for connection string (lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916) and timeout (15000) / elapsed (31006)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.readValue(SharedValue.java:244) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.access$100(SharedValue.java:44) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue$1.process(SharedValue.java:61) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) ~[curator-framework-2.7.1.jar:?]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
2016/08/29 16:47:36.815 ERROR [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection timed out for connection string (lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916) and timeout (15000) / elapsed (47008)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.readValue(SharedValue.java:244) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.access$100(SharedValue.java:44) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue$1.process(SharedValue.java:61) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) ~[curator-framework-2.7.1.jar:?]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
2016/08/29 16:47:53.817 WARN [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection attempt unsuccessful after 64011 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.
2016/08/29 16:47:53.819 INFO [ZooKeeper] [SomeService STARTING-EventThread] [searcher] [] Session: 0x255531b1ff4732e closed
2016/08/29 16:47:53.819 INFO [ZooKeeper] [SomeService STARTING-EventThread] [searcher] [] Initiating client connection, connectString=lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@2eee547b
2016/08/29 16:47:53.824 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1295.prod.acompany.com:12916)] [searcher] [] Opening socket connection to server lva1-app1295.prod.acompany.com/10.132.48.218:12916. Will not attempt to authenticate using SASL (unknown error)
2016/08/29 16:47:53.824 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1295.prod.acompany.com:12916)] [searcher] [] Socket connection established to lva1-app1295.prod.acompany.com/10.132.48.218:12916, initiating session
2016/08/29 16:47:53.827 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1295.prod.acompany.com:12916)] [searcher] [] Session establishment complete on server lva1-app1295.prod.acompany.com/10.132.48.218:12916, sessionid = 0x355531abd0978d1, negotiated timeout = 40000
2016/08/29 16:47:53.827 INFO [ConnectionStateManager] [SomeService STARTING-EventThread] [searcher] [] State change: RECONNECTED
2016/08/29 16:47:53.827 INFO [ClientCnxn] [SomeService STARTING-EventThread] [searcher] [] EventThread shut down
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)