You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Michael Han (Jira)" <ji...@apache.org> on 2020/10/19 03:35:00 UTC

[jira] [Commented] (ZOOKEEPER-3972) Convergence fail when a follower tries to resync with a leader having incomplete commitlog

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216430#comment-17216430 ] 

Michael Han commented on ZOOKEEPER-3972:
----------------------------------------

Thanks for reporting this, [~anaud].

I did a brief look, and I feel the issue is real. However, I tried the test case and it can pass both with and without the proposed fix. I suspect the test case pass because it fell into a "resort to sending snapshot" code path we added back in ZOOKEEPER-2418 - basically when we detected there is a gap between transaction log and the commit log, we revert to send a snapshot. To fully test the issue reported here, we would need to refine the test case attached here so we have a case where the new leader does not have any on disk transaction log so the code path added in ZOOKEEPER-2418 will not be hit (and mask the real issue we try to fix here). 

Another side note - please check out https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute on how to submit patches - the old way of attaching patch file to JIRA has been deprecated. Please use github pull request.

> Convergence fail when a follower tries to resync with a leader having incomplete commitlog
> ------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3972
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3972
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.8
>            Reporter: anaud
>            Priority: Major
>         Attachments: ZOOKEEPER-3972.patch, zookeeper-testResyncWithLeaderHavingIncompleteCommitlog.patch
>
>
> It is possible that a leader may have incomplete commitlog because it resync'ed with the old leader via SNAPSHOT replication.
> Then, a follower may try to resync with the leader, but because there may be some transactions the follower missed earlier and the leader does not have in its commitlog.
> They decided to use txnlog + commitlog to resync. However, this will lead to convergence failure because the leader does not send the missing transactions that are not in its commitlog.
> Here is the abstract step to reproduce the bug, and I attached the patch with the test case that can reproduce the bug.
> Initially, node A,B,C are all sync'ed.
>  1. Node A crashes; setData 0x11 on B and C
>  2. Node B and C crash
>  3. Node A and B restart
>  4. Node A crashes; setData 0x21 on B
>  5. Node B crashes
>  6. Node B and C restart
>  7. Node C crashes; setData 0x32 on B
>  8. Node A and C restart
>  9. Node B restarts
> At step 6, C is a follower getting a snapshot from B, and C does not have the transaction 0x21 in its commitlog (only in the snapshot).
> At step 8, C is the leader which does not have 0x21 in its commitlog, which A never gets.
> In the end, 0x21 only exists on B and C, but not on A.
> I think the solution would be made to LearnerHandler's  syncFollower method as follows:
>  1. Check the last transaction it has in its txnlog + commitlog
>  2. If it is more recent than what it has in its txnlog + commitlog, then it should use Snapshot
>  3. Otherwise, continue with txnlog + commitlog replication
> I attached a patch containing the proposed fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)