You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Dave Latham (JIRA)" <ji...@apache.org> on 2013/07/16 20:00:55 UTC

[jira] [Created] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock

Dave Latham created ZOOKEEPER-1731:
--------------------------------------

             Summary: Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
                 Key: ZOOKEEPER-1731
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731
             Project: ZooKeeper
          Issue Type: Bug
            Reporter: Dave Latham
            Priority: Critical
             Fix For: 3.4.6


We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer briefly for maintenance.  A second peer became unresponsive and the leader lost quorum.  Thread dumps on the second peer showed two threads consistently stuck in these states:

{noformat}
"QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x00002aaab8d20800 nid=0x598a runnable [0x000000004335d000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.put(HashMap.java:405)
        at org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131)
        at org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572)
        at org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444)
        at org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)


"NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 tid=0x00002aaab84b0800 nid=0x5986 runnable [0x0000000040878000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.removeEntryForKey(HashMap.java:614)
        at java.util.HashMap.remove(HashMap.java:581)
        at org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120)
        at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971)
        - locked <0x000000078d8a51f0> (a java.util.HashSet)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294)
        - locked <0x000000078d82c750> (a org.apache.zookeeper.server.NIOServerCnxnFactory)
        at org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834)
        at org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410)
        at org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Thread.java:662)
{noformat}

It shows both threads concurrently modifying ServerCnxnFactory.connectionBeans which is a java.util.HashMap.

This cluster was serving thousands of clients, which seems to make this condition more likely as it appears to occur when one client connects and another disconnects at about the same time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira