You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org> on 2010/01/13 10:36:55 UTC

[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799669#action_12799669 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-569:
--------------------------------------------------

One way to implement the test is to implement a mock server to force the particular message interleaving that triggers the bug. No claim it is the best way, but it seemed to be a good idea for FLELostMessageTest.

> Failure of elected leader can lead to never-ending leader election
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-569
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Henry Robinson
>            Assignee: Henry Robinson
>
> It is possible for basic LeaderElection to enter a situation where it never terminates. 
> As an example, consider a three node cluster A, B and C.
> 1. In the first round, A votes for A, B votes for B and C votes for C
> 2. Since C > B > A, all nodes resolve to vote for C in the second round as there is no first round winner
> 3. A, B vote for C, but C fails.
> 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded
> 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. 
> Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue.
> I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. 
> I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.