You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Mate Szalay-Beko (Jira)" <ji...@apache.org> on 2020/03/13 11:18:00 UTC

[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058642#comment-17058642 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3756:
---------------------------------------------

Hello!

I was working on these parts recently and happy to take a look on your case.

The log file you sent is more-or-less OK. In ZooKeeper, the servers are communicating with each other using the 3888 port (in your config) for election protocol. When a server starts, it tries to connect to all other server's election port and asks for the IDs from each server. But only those channels remain, which was initiated by the servers with higher ID. This is why you see the following message, which is completely normal:

{code:java}
2020-03-11 20:23:35,733 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2)
{code}
And also the SendWorker thread gets interrupted towards server 3 because of the same reason. That is also OK.


The only strange thing I noticed in your logs are in these lines: 

{code:java}
2020-03-11 20:23:35,734 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36140

2020-03-11 20:23:35,740 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36142
{code}

It looks like you get two connection requests from this IP: 100.126.116.201. This IP is not in your config.

Are you sure you are using the same config you sent in all ZK nodes? Can you check where this 100.126.116.201 comes from?

There is a known bug which could explain your situation, but that would happen only if you use 0.0.0.0 in your configs.

Are you using some dockerized environment maybe?

Can you share the ZooKeeper configs and server logs from all the 5 nodes? 

> Members failing to rejoin quorum
> --------------------------------
>
>                 Key: ZOOKEEPER-3756
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: leaderElection
>    Affects Versions: 3.5.6, 3.5.7
>            Reporter: Dai Shi
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for message on queue
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2 (so only servers 1, 2, and 3 remain in the configuration file), then they can rejoin the quorum fine. Is this expected and am I doing something wrong? Any help or explanation would be greatly appreciated. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)