You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Patrick Hunt (JIRA)" <ji...@apache.org> on 2012/07/20 02:52:35 UTC

[jira] [Created] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Patrick Hunt created ZOOKEEPER-1514:
---------------------------------------

             Summary: FastLeaderElection - leader ignores the round information when joining a quorum
                 Key: ZOOKEEPER-1514
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.3.4
            Reporter: Patrick Hunt
            Priority: Critical
             Fix For: 3.4.4, 3.5.0, 3.3.7


In the following case we have a 3 server ensemble.

Initially all is well, zk3 is the leader.

However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)

The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.

zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.

Later all three servers are later restarted and properly form a functional quourm.


Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:

zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:

{noformat}
2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
{noformat}

zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:

{noformat}
2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flavio Junqueira updated ZOOKEEPER-1514:
----------------------------------------

    Attachment: ZOOKEEPER-1514.patch
    
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425397#comment-13425397 ] 

Henry Robinson commented on ZOOKEEPER-1514:
-------------------------------------------

Hi Flavio - 

I don't really mind the check, it's just completely unnecessary (since listener == null => NPE => failed test). Let's keep it in if you think it is important. 

What is a problem, and I agree not worth fixing here, is that this is yet another example of class members not being hidden behind getters / setters that maintain correct invariants. Anyone can set listener to null, because it's a non-final public member, so every read of that variable in code that mustn't crash has to defensively check that it's not null, when we should be relying on the class to do this for us. 

Anyhow, this looks ok to me - +1, happy to commit. 
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427988#comment-13427988 ] 

Hudson commented on ZOOKEEPER-1514:
-----------------------------------

Integrated in ZooKeeper-trunk #1635 (See [https://builds.apache.org/job/ZooKeeper-trunk/1635/])
    ZOOKEEPER-1514. FastLeaderElection - leader ignores the round information when joining a quorum (flavio via henryr) (Revision 1368737)

     Result = FAILURE
henry : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1368737
Files : 
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLEBackwardElectionRoundTest.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLELostMessageTest.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLETestUtils.java

                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flavio Junqueira updated ZOOKEEPER-1514:
----------------------------------------

    Attachment: ZOOKEEPER-1514.patch
    
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423776#comment-13423776 ] 

Flavio Junqueira commented on ZOOKEEPER-1514:
---------------------------------------------

I used blame to track where we have introduced the check in tests for the first time. I believe it started with ZOOKEEPER-480, and we were essentially trying to implement a mock server. If you check QuorumPeer.createElectionAlgorithm(), this is essentially what we are doing there. I'd rather leave as is, and if you feel we need to revisit the way we are starting the listener, then perhaps we need to create another jira for it, since it touches other parts of the code base.

I'll upload a patch with the other modifications you suggested, so that if you agree with my assessment, then we can move forward with that one. 
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flavio Junqueira updated ZOOKEEPER-1514:
----------------------------------------

    Attachment: ZOOKEEPER-1514.patch

Patch to fix this issue including test.
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426138#comment-13426138 ] 

Flavio Junqueira commented on ZOOKEEPER-1514:
---------------------------------------------

bq. I don't really mind the check, it's just completely unnecessary (since listener == null => NPE => failed test). Let's keep it in if you think it is important. 

Sounds fine, I'll remove the check from both tests and upload a new patch. 

bq. What is a problem, and I agree not worth fixing here, is that this is yet another example of class members not being hidden behind getters / setters that maintain correct invariants. Anyone can set listener to null, because it's a non-final public member, so every read of that variable in code that mustn't crash has to defensively check that it's not null, when we should be relying on the class to do this for us. 

I totally agree. I think we did not expect the listener object to be called in multiple places and were a bit lazy. :-)


                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426172#comment-13426172 ] 

Hadoop QA commented on ZOOKEEPER-1514:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12538623/ZOOKEEPER-1514.patch
  against trunk revision 1366784.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1148//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1148//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1148//console

This message is automatically generated.
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420379#comment-13420379 ] 

Henry Robinson commented on ZOOKEEPER-1514:
-------------------------------------------

Hey Flavio - 

Thanks for fixing this so quickly! Patch looks really nice, a few nits:

* I don't think you need to duplicate {{createMsg}} in {{FLEBackwardElectionRound}}, since it's now in {{FLETestUtils}}
* Could you add a comment to {{FLEBackwardElectionRound.testBackwardElectionRound}} describing the bug it's testing for, and I guess referencing this JIRA?
* If {{listener}} is {{null}} for {{QuorumCnxManager.Listener listener = cnxManagers[0].listener;}} and similar, shouldn't the test fail straight away? Under what circumstances would this be true?
* There's a small typo - 'instace' -> 'instance'

                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flavio Junqueira updated ZOOKEEPER-1514:
----------------------------------------

    Attachment: ZOOKEEPER-1514.patch
    
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419776#comment-13419776 ] 

Hadoop QA commented on ZOOKEEPER-1514:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12537441/ZOOKEEPER-1514.patch
  against trunk revision 1362660.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1142//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1142//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1142//console

This message is automatically generated.
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424286#comment-13424286 ] 

Henry Robinson commented on ZOOKEEPER-1514:
-------------------------------------------

Flavio - this looks fine. The point I am trying to make about this bit of code:

{code}
  if(listener != null){
    listener.start();
  } else {
    LOG.error("Null listener when initializing cnx manager");
    Assert.fail("Failed to create cnx manager");
  }
{code}

is that there's no need for the null check, since if {{listener}} is null, there'll be an NPE thrown which will fail the test anyhow. Plus, looking at {{QuorumCnxManager.java:153}}, I can't see any way in which {{listener}} can be null, because it's unambiguously assigned to a {{new Listener()}}. Is there a case that I'm missing?

I know this doesn't really affect the functionality of the patch, but if these checks aren't necessary, it will be confusing to the reader in the future. 
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423443#comment-13423443 ] 

Henry Robinson commented on ZOOKEEPER-1514:
-------------------------------------------

I'm not sure that removing the null checks would mean findbugs warnings (easy to try!) - and if the listener is null, the test will throw an NPE and fail anyhow which seems like the right thing to do. So I would suggest just removing the null checks. What do you think?
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424388#comment-13424388 ] 

Flavio Junqueira commented on ZOOKEEPER-1514:
---------------------------------------------

Hi Henry, I believe I understand the point you're raising, so perhaps I'm not making myself clear. Let me try to add more detail.

This null check:

{noformat}
  if(listener != null){
    listener.start();
  } else {
    LOG.error("Null listener when initializing cnx manager");
    Assert.fail("Failed to create cnx manager");
  }
{noformat}

appears in a number of places in the code, essentially every time we use the listener. The first time it appeared was in QuorumPeer.createElectionAlgorithm() due to findbugs warnings as I mentioned before (ZOOKEEPER-407). 

When we created a mock server for FLELostMessageTest, we simply copied that part that starts a listener. Currently, it appears in at least a couple of places, and if I remove from this patch, we should also remove from the other places. But, removing it from the other parts of the code is not part of this issue, so if you feel strongly about this change, I suggest we leave the patch with this check in and discuss removing the null check in another jira so that we make uniform changes across the code, not mixing the issues.  



                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423789#comment-13423789 ] 

Hadoop QA commented on ZOOKEEPER-1514:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12538144/ZOOKEEPER-1514.patch
  against trunk revision 1362660.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1145//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1145//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1145//console

This message is automatically generated.
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch, ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1514) FastLeaderElection - leader ignores the round information when joining a quorum

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419465#comment-13419465 ] 

Hadoop QA commented on ZOOKEEPER-1514:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12537385/ZOOKEEPER-1514.patch
  against trunk revision 1362660.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    -1 release audit.  The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings).

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1140//testReport/
Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1140//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1140//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1140//console

This message is automatically generated.
                
> FastLeaderElection - leader ignores the round information when joining a quorum
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1514
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1514
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.4
>            Reporter: Patrick Hunt
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 3.4.4, 3.5.0, 3.3.7
>
>         Attachments: ZOOKEEPER-1514.patch
>
>
> In the following case we have a 3 server ensemble.
> Initially all is well, zk3 is the leader.
> However zk3 fails, restarts, and rejoins the quorum as the new leader (was the old leader, still the leader after re-election)
> The existing two followers, zk1 and zk2 rejoin the new quorum again as followers of zk3.
> zk1 then fails, the datadirectory is deleted (so it has no state whatsoever) and restarted. However zk1 can never rejoin the quorum (even after an hour). During this time zk2 and zk3 are serving properly.
> Later all three servers are later restarted and properly form a functional quourm.
> Here are some interesting log snippets. Nothing else of interest was seen in the logs during this time:
> zk3. This is where it becomes the leader after failing initially (as the leader). Notice the "round" is ahead of zk1 and zk2:
> {noformat}
> 2012-07-18 17:19:35,423 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  3, Proposed zxid = 77309411648
> 2012-07-18 17:19:35,423 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:19:35,424 - INFO  [QuorumPeer:/0.0.0.0:2181:QuorumPeer@655] - LEADING
> {noformat}
> zk1 which won't come back. Notice that zk3 is reporting the round as 831, while zk2 thinks that the round is 832:
> {noformat}
> 2012-07-18 17:31:12,015 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 77309411648 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,016 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 73014444480 (n.zxid), 831 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:12,017 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 77309411648 (n.zxid), 832 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
> 2012-07-18 17:31:15,219 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@697] - Notification time out: 6400
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira