You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Karolos Antoniadis (Jira)" <ji...@apache.org> on 2019/09/03 17:09:00 UTC
[jira] [Created] (ZOOKEEPER-3534) Non-stop communication between
participants and observers.
Karolos Antoniadis created ZOOKEEPER-3534:
---------------------------------------------
Summary: Non-stop communication between participants and observers.
Key: ZOOKEEPER-3534
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3534
Project: ZooKeeper
Issue Type: Bug
Reporter: Karolos Antoniadis
Attachments: create_np_case_3.sh
Hello ZooKeeper developers,
there are cases during *leader election*, where there is non-stop communication between observers and participants.
This communication occurs as follows:
- an observer sends a notification to a participant
- the participant responds
- an observer sends another notification and so on and so forth ...
It is possible that an observer-participant pair exchange hundreds of notification messages in the span of one second. As a consequence, the system is burdened with unnecessary load, and the logs are filled with useless information as can be seen below:
{noformat}
2019-09-03 16:37:22,630 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,632 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,633 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000
2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format version:0x2, n.config version:0x100000000{noformat}
h4. Why does the non-stop communication bug occur?
This bug stems from the fact that when a participant receives a notification from an observer, the participant responds right away, as can be seen [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L325] - it is even written in the comments. Now, when the observer receives back the message from the participant there are 2 cases that could lead to non-stop communication:
1) The observer has a greater {{logicalclock}} than the participant and both the observer and the participant are in a {{LOOKING}} state. In such a case, the observer responds right away to the participant as can be seen [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L392].
2) The observer is {{OBSERVING}} while the participant is still {{LOOKING}}, then the non-stop communication ensues due to the code in [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L413].
h4.
h4. How can we reproduce this non-stop communication bug?
It is not trivial to reproduce this bug, although we saw it occurring in the wild. To reproduce this bug, we provide a script that utilizes docker and that can be used to easily debug ZK code. The script starts a ZK cluster with 3 participants (P1, P2, P3) and 2 observers (O1, O2). The script together with instructions on how to use it can be found [here|https://github.com/insumity/zookeeper_debug_tool].
Using the script, there are at least 2 ways to reproduce the bug:
1) We can artificially delay the leader election by introducing the following code in {{FastLeaderElection}} (in [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L1006]).
{code:java}
// Verify if there is any change in the proposed leader
int time = finalizeWait;
if (self.getId() >= 1 && self.getId() <= 3) {
time = 2000;
}{code}
and changing the immediate succeeding line:
{code:java}
while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {code}
to
{code:java}
while ((n = recvqueue.poll(time, TimeUnit.MILLISECONDS)) != null) {
{code}
Now, if we run a ZK cluster and force a leader election by killing the leader, we see the non-stop communication occurring. The reason is that as a result of this delay the observer restarts (increments its {{logicalclock}}), tries to connect to the previous leader, but fails since the previous leader is crashed, and the observer restarts by incrementing {{logicalclock}} once more and hence starting the non-stop communication.
2) Another way to reproduce the bug is by creating a network partition that partitions P1 from P2, P3, O2 but that still keeps participant P1 connected to observer O1. In such a case, the non-stop communication ensues since O1 is {{OBSERVING}} while P1 remains in a {{LOOKING}} state. To reproduce this bug, using the above script, someone just has to do:
* wait till the ZK cluster starts running
* in your local machine do ./create_np_case_3.sh (attached file in this issue)
* force a leader election by restarting the leader (most likely the leader is server 3)
It is true that scenario 2 is slightly unrealistic. However, the first scenario where leader election takes too much time to complete is pretty realistic. Whenever we saw this non-stop communication bug, it was because leader election took too long to complete. For instance, it could occur if there is some type of split-vote during LE and the elected leader times out while
{noformat}
waiting for epoch from quorum {noformat}
[here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1350].
h4.
How can we fix this issue?
One idea would be that before an observer starts observing a leader, it verifies that the leader is up and running using a check similar to {{checkLeader}} as is done [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L1037].
This will prevent from having non-stop communication between observers and participants during long leader elections, since observers do not try to connect to an already failed leader, and hence they will not increase their {{logicalclock}}. However, this fix on its own does not solve the 2nd way to reproduce the bug that was described above.
Best Regards,
Karolos
--
This message was sent by Atlassian Jira
(v8.3.2#803003)