You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Karolos Antoniadis (Jira)" <ji...@apache.org> on 2019/09/03 17:09:00 UTC
[jira] [Created] (ZOOKEEPER-3534) Non-stop communication between participants and observers.

Karolos Antoniadis created ZOOKEEPER-3534:
---------------------------------------------

             Summary: Non-stop communication between participants and observers.
                 Key: ZOOKEEPER-3534
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3534
             Project: ZooKeeper
          Issue Type: Bug
            Reporter: Karolos Antoniadis
         Attachments: create_np_case_3.sh

Hello ZooKeeper developers,

there are cases during *leader election*, where there is non-stop communication between observers and participants. 
This communication occurs as follows: 
- an observer sends a notification to a participant
- the participant responds
- an observer sends another notification and so on and so forth ...

It is possible that an observer-participant pair exchange hundreds of notification messages in the span of one second. As a consequence, the system is burdened with unnecessary load, and the logs are filled with useless information as can be seen below:

 
{noformat}
2019-09-03 16:37:22,630 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,632 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,633 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000{noformat}
 

 
h4. Why does the non-stop communication bug occur?

This bug stems from the fact that when a participant receives a notification from an observer, the participant responds right away, as can be seen [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L325] - it is even written in the comments. Now, when the observer receives back the message from the participant there are 2 cases that could lead to non-stop communication:
1) The observer has a greater {{logicalclock}} than the participant and both the observer and the participant are in a {{LOOKING}} state. In such a case, the observer responds right away to the participant as can be seen [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L392]. 
2) The observer is {{OBSERVING}} while the participant is still {{LOOKING}}, then the non-stop communication ensues due to the code in [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L413].  
h4.  
h4. How can we reproduce this non-stop communication bug?

It is not trivial to reproduce this bug, although we saw it occurring in the wild. To reproduce this bug, we provide a script that utilizes docker and that can be used to easily debug ZK code. The script starts a ZK cluster with 3 participants (P1, P2, P3) and 2 observers (O1, O2). The script together with instructions on how to use it can be found [here|https://github.com/insumity/zookeeper_debug_tool].

 

Using the script, there are at least 2 ways to reproduce the bug:
1) We can artificially delay the leader election by introducing the following code in {{FastLeaderElection}} (in [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L1006]).

 
{code:java}
// Verify if there is any change in the proposed leader
int time = finalizeWait;
if (self.getId() >= 1 && self.getId() <= 3) {
    time = 2000;
}{code}
 

and changing the immediate succeeding line:
{code:java}
while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {code}
to 

 
{code:java}
while ((n = recvqueue.poll(time, TimeUnit.MILLISECONDS)) != null) { 
{code}
Now, if we run a ZK cluster and force a leader election by killing the leader, we see the non-stop communication occurring. The reason is that  as a result of this delay the observer restarts (increments its {{logicalclock}}), tries to connect to the previous leader, but fails since the previous leader is crashed, and the observer restarts by incrementing {{logicalclock}} once more and hence starting the non-stop communication.


2) Another way to reproduce the bug is by creating a network partition that partitions P1 from P2, P3, O2 but that still keeps participant P1 connected to observer O1. In such a case, the non-stop communication ensues since O1 is {{OBSERVING}} while P1 remains in a {{LOOKING}} state. To reproduce this bug, using the above script, someone just has to do:
 *  wait till the ZK cluster starts running
 *  in your local machine do ./create_np_case_3.sh (attached file in this issue)
 *  force a leader election by restarting the leader (most likely the leader is server 3)


It is true that scenario 2 is slightly unrealistic. However, the first scenario where leader election takes too much time to complete is pretty realistic.  Whenever we saw this non-stop communication bug, it was because leader election took too long to complete. For instance, it could occur if there is some type of split-vote during LE and the elected leader times out while
{noformat}
waiting for epoch from quorum {noformat}
[here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1350].

 
h4. 
How can we fix this issue?

One idea would be that before an observer starts observing a leader, it verifies that the leader is up and running using a check similar to {{checkLeader}} as is done [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L1037].
This will prevent from having non-stop communication between observers and participants during long leader elections, since observers do not try to connect to an already failed leader, and hence they will not increase their {{logicalclock}}. However, this fix on its own does not solve the 2nd way to reproduce the bug that was described above.

Best Regards,
Karolos

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)