You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Jiafu Jiang (JIRA)" <ji...@apache.org> on 2018/03/01 01:12:00 UTC
[jira] [Commented] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381347#comment-16381347 ] 

Jiafu Jiang commented on ZOOKEEPER-2930:
----------------------------------------

I hope this PB can be fix in version 3.4.X, since 3.4.X is the stable version.

> Leader cannot be elected due to network timeout of some members.
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2930
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.5.4, 3.4.12
>         Environment: Java 8
> ZooKeeper 3.4.11(from github)
> Centos6.5
>            Reporter: Jiafu Jiang
>            Priority: Critical
>         Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:20.10.11.101, 30.10.11.101
> ofs_zk2:20.10.11.102, 30.10.11.102
> ofs_zk3:20.10.11.103, 30.10.11.103
> I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command.
> It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader.
> I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem.
> I read the log and the ZooKeeper source code, and I think I find the reason.
> When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it will send notifications to all the servers. 
> When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value.
> The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications.
> The WorkerSender just process the notifications one by one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse.
> Here is the related source code:
> {code:java}
>     public void toSend(Long sid, ByteBuffer b) {
>         /*
>          * If sending message to myself, then simply enqueue it (loopback).
>          */
>         if (this.mySid == sid) {
>              b.position(0);
>              addToRecvQueue(new Message(b.duplicate(), sid));
>             /*
>              * Otherwise send to the corresponding thread to send.
>              */
>         } else {
>              /*
>               * Start a new connection if doesn't have one already.
>               */
>              ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
>              ArrayBlockingQueue<ByteBuffer> bqExisting = queueSendMap.putIfAbsent(sid, bq);
>              if (bqExisting != null) {
>                  addToSendQueue(bqExisting, b);
>              } else {
>                  addToSendQueue(bq, b);
>              }
>              
>              // This may block!!!
>              connectOne(sid);
>                 
>         }
>     }
> {code}
> Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack, but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk3) because the ofs_zk3 has not sent the notification(which may still exist in the sendqueue of WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quits the leader and begins a new election. 
> The log files of ofs_zk1 and ofs_zk3 are attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)