You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "anaud (Jira)" <ji...@apache.org> on 2020/10/15 15:25:00 UTC
[jira] [Created] (ZOOKEEPER-3972) Convergence fail when a follower
tries to resync with a leader having incomplete commitlog
anaud created ZOOKEEPER-3972:
--------------------------------
Summary: Convergence fail when a follower tries to resync with a leader having incomplete commitlog
Key: ZOOKEEPER-3972
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3972
Project: ZooKeeper
Issue Type: Bug
Components: server
Affects Versions: 3.5.8
Reporter: anaud
Attachments: zookeeper-testResyncWithLeaderHavingIncompleteCommitlog.patch
It is possible that a leader may have incomplete commitlog because it resync'ed with the old leader via SNAPSHOT replication.
Then, a follower may try to resync with the leader, but because there may be some transactions the follower missed earlier and the leader does not have in its commitlog.
They decided to use txnlog + commitlog to resync. However, this will lead to convergence failure because the leader does not send the missing transactions that are not in its commitlog.
Here is the abstract step to reproduce the bug, and I attached the patch with the test case that can reproduce the bug.
Initially, node A,B,C are all sync'ed.
1. Node A crashes; setData 0x11 on B and C
2. Node B and C crash
3. Node A and B restart
4. Node A crashes; setData 0x21 on B
5. Node B crashes
6. Node B and C restart
7. Node C crashes; setData 0x32 on B
8. Node A and C restart
9. Node B restarts
At step 6, C is a follower getting a snapshot from B, and C does not have the transaction 0x21 in its commitlog (only in the snapshot).
At step 8, C is the leader which does not have 0x21 in its commitlog, which A never gets.
In the end, 0x21 only exists on B and C, but not on A.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)