You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Michael Han (JIRA)" <ji...@apache.org> on 2017/03/13 16:22:03 UTC
[jira] [Updated] (ZOOKEEPER-1005) Zookeeper servers fail to elect a
leader succesfully.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Han updated ZOOKEEPER-1005:
-----------------------------------
Fix Version/s: (was: 3.5.3)
3.5.4
> Zookeeper servers fail to elect a leader succesfully.
> -----------------------------------------------------
>
> Key: ZOOKEEPER-1005
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.2.2
> Environment: zookeeper-3.2.2; debian
> Reporter: Alexandre Hardy
> Fix For: 3.5.4, 3.6.0
>
>
> We were running 3 zookeeper servers, and simulated a failure on one of the servers.
> The one zookeeper node follows the other, but has trouble connecting. It looks like the following exception is the cause:
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Following zookeeper3/192.168.131.11:2888
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING java.net.ConnectException: -- Connection refused
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> {noformat}
> The last exception while connecting was:
> {noformat}
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Unexpected exception
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: -- Connection refused
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Exception when following the leader
> {noformat}
> The leader started leading a bit later
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, 94489312534, 25, 2, LOOKING, LOOKING, 0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
> {noformat}
> But at that time the follower had already terminated and started a new election, so the leader failed:
> {noformat}
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot /var/lib/zookeeper/version-2/snapshot.1600007d16
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: 1600007d16
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
> 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
> 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.Leader] Shutdown called
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
> {noformat}
> From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
> {quote}
> The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper servers in quorum have to connect to a leader
> {quote}
> Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a zookeeper server to contact the leader.
> However, in the source code src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
> {noformat}
> 152 for (int tries = 0; tries < 5; tries++) {
> 153 try {
> 154 //sock = new Socket();
> 155 //sock.setSoTimeout(self.tickTime * self.initLimit);
> 156 sock.connect(addr, self.tickTime * self.syncLimit);
> 157 sock.setTcpNoDelay(nodelay);
> 158 break;
> 159 } catch (IOException e) {
> 160 if (tries == 4) {
> 161 LOG.error("Unexpected exception",e);
> 162 throw e;
> 163 } else {
> 164 LOG.warn("Unexpected exception, tries="+tries,e);
> 165 sock = new Socket();
> 166 sock.setSoTimeout(self.tickTime * self.initLimit);
> 167 }
> 168 }
> 169 Thread.sleep(1000);
> 170 }
> {noformat}
> It appears as if we only have 4 seconds to contact the leader. The timeouts are applied to the socket, but do not take into account that the zookeeper leader may not have started its zookeeper service yet.
> Is this the expected behaviour? Or is the expected behaviour that followers should always be able to connect to the leader?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)