You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Vishal K (JIRA)" <ji...@apache.org> on 2011/01/14 13:29:47 UTC

[jira] Created: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

new peer goes in LEADING state even if ensemble is online
---------------------------------------------------------

                 Key: ZOOKEEPER-975
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.3.2
            Reporter: Vishal K
             Fix For: 3.3.3



Scenario:
1. 2 of the 3 ZK nodes are online
2. Third node is attempting to join
3. Third node unnecessarily goes in "LEADING" state
4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.


While going through the logs I noticed that a peer C that is trying to
join an already formed cluster goes in LEADING state. This is because
QuorumCnxManager of A and B sends the entire history of notification
messages to C. C receives the notification messages that were
exchanged between A and B when they were forming the cluster.

In FastLeaderElection.lookForLeader(), due to the following piece of
code, C quits lookForLeader assuming that it is supposed to lead.

740                             //If have received from all nodes, then terminate
741                             if ((self.getVotingView().size() == recvset.size()) &&
742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
743                                 self.setPeerState((proposedLeader == self.getId()) ?
744                                         ServerState.LEADING: learningState());
745                                 leaveInstance();
746                                 return new Vote(proposedLeader, proposedZxid);
747
748                             } else if (termPredicate(recvset,


This can cause:
1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.

2. C waits for 200 ms (finalizeWait) and then considers whatever
notifications it has received to make a decision. C could potentially
decide to follow an old leader, fail to connect to the leader, and
then restart FLE. See code below.

752                             if (termPredicate(recvset,
753                                     new Vote(proposedLeader, proposedZxid,
754                                             logicalclock))) {
755 
756                                 // Verify if there is any change in the proposed leader
757                                 while((n = recvqueue.poll(finalizeWait,
758                                         TimeUnit.MILLISECONDS)) != null){
759                                     if(totalOrderPredicate(n.leader, n.zxid,
760                                             proposedLeader, proposedZxid)){
761                                         recvqueue.put(n);
762                                         break;
763                                     }
764                                 }



In general, this does not affect correctness of FLE since C will
eventually go back to FOLLOWING state (A and B won't vote for
C). However, this delays C from joining the cluster. This can in turn
affect recovery time of an application.


Proposal: A and B should send only the latest notification (most
recent) instead of the entire history. Does this sound reasonable?





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Updated] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch4

Attaching patch with test.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015975#comment-13015975 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

Do you think we need a test for this? I was looking through the code to see how we can write a test. What we can do is insert notifications in recvqueue for a peer, then call lookForLeader(), and monitor the state/proposdZxid/proposedLeader/ect. This will let us feed whatever notifications we want to FLE. The other peers should just ignore the notifications (or send notifications that we want them to send). 

However, for this we will have to make changes to FastLeaderElection so that one can overload its Messenger, modify recvqueue, set proposedLeader, propsedZxid, etc from the test. I think this will be a good change in general so that we can feed notifications to a peer and test for corner cases, but a bit time consuming. I am not sure how much that will help for this particular bug though. What do you think?

-Vishal

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019429#comment-13019429 ] 

Hadoop QA commented on ZOOKEEPER-975:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12476255/ZOOKEEPER-975.patch5
  against trunk revision 1091841.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/226//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/226//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/226//console

This message is automatically generated.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch

Hi Flavio,

Heres a simple fix that I think does the trick. Let me know what you think.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987494#action_12987494 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi VIshal, The main problem I see is that the patch you propose removes lastMessageSent( ZOOKEEPER-481), which was inserted to deal with a problem observed in ZOOKEEPER-475, and also discussed in ZOOKEEPER-480.


> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019118#comment-13019118 ] 

Hadoop QA commented on ZOOKEEPER-975:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12476190/ZOOKEEPER-975.patch4
  against trunk revision 1091314.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause tar ant target to fail.

    -1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/225//testReport/
Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/225//console

This message is automatically generated.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024915#comment-13024915 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

the findbugs failure here is due to ZOOKEEPER-1052 as pointed out by Flavio.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990064#comment-12990064 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

Can you describe to me this problem? I will see if the problem still exists if this patch is applied after applying patch for ZOOKEEPER-932. Thanks.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch6

fixed findbugs warning

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993551#comment-12993551 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, I'm under the impression that getting the current state only works for LEADING or FOLLOWING. While looking, we would need to return the latest vote, yes? Overall, it is sound to return the current state instead of keeping the latest message.

About maintenance, we have some time back talked about maintaining only the TCP version of FLE (FLE+QCM). There was never some real pressure to eliminate the others, and in fact previously some users were still using LE. I'm all for maintaining only FLE, but we need to hear the opinion of others.

More thoughts?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024920#comment-13024920 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Does anyone know why we are getting this javadoc warning? ZOOKEEPER-1052 also got -1 on javadoc, and I'm not sure what the problem is. Any hint would be welcome.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014532#comment-13014532 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, Apart from not having a test, it is +1 for me. Looks good.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010989#comment-13010989 ] 

Hadoop QA commented on ZOOKEEPER-975:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12474564/ZOOKEEPER-975.patch2
  against trunk revision 1082362.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/199//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/199//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/199//console

This message is automatically generated.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024913#comment-13024913 ] 

Hadoop QA commented on ZOOKEEPER-975:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12477322/ZOOKEEPER-975.patch6
  against trunk revision 1091841.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/232//testReport/
Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/232//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/232//console

This message is automatically generated.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010689#comment-13010689 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, I'll have a look at the patch once you generate a new one that applies. 

On the last comment you posted, I wonder why you think it would be better to update the last zxid according to the state of the server. One problem I see is that only the leader can really maintain its own last zxid up to date. The other servers don't really know how far the leader has gone. 

Also, the idea of maintaining the last vote intact is to keep the pair used to decide upon the current leader, and currently we don't use the zxid field to determine leadership if the notification says LEADING or FOLLOWING. Is there anything I'm missing?   

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012598#comment-13012598 ] 

Hadoop QA commented on ZOOKEEPER-975:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12474906/ZOOKEEPER-975.patch3
  against trunk revision 1082362.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/209//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/209//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/209//console

This message is automatically generated.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011244#comment-13011244 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, Here are two high-level comments:

# It does not have a test, but I'm not entirely convinced we should try to implement one, since it might be complex. We should think about it, though;
# I don't understand why you're removing the outofelection data structure. I believe the notifications from peers that are FOLLOWING/LEADING should be treated separately from the the ones of peers that are LOOKING. If a peer obtains notifications from a quorum saying that the peers are LEADING/FOLLOWING, then it should follow the leader they point to. If a peer obtains notifications from a quorum of FOLLOWING peers, then it should follow the protocol to select a leader. Consequently, the notifications should be treated separately.   

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027316#comment-13027316 ] 

Hudson commented on ZOOKEEPER-975:
----------------------------------

Integrated in ZooKeeper-trunk #1168 (See [https://builds.apache.org/hudson/job/ZooKeeper-trunk/1168/])
    ZOOKEEPER-975. new peer goes in LEADING state even if ensemble is online. (vishal via fpj)


> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984722#action_12984722 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

What is the motivation to send the history of notifications to the
joining peer? Shouldn't the most recent notification (or just the
current state) be enough?  I understand this
is a performance issue. However, I think it is a sizeable hole.

- There could have been multiple leader
elections while the node is down and the node could end up hopping
across leaders until it gets to the correct leader.

- Suppose, we have a 3 node cluster. I have a simple client which
connects to A and creates a znode_A to indicate that A (and the
client) is online. The leader A disconnects from B and C and causes C
to take leadership. Now, when A is trying to join the cluster, it can
be unnecessarily delayed due to this bug. If I have an application that
takes some action if znode_A is unavailable, then this bug can
unnecessarily trigger that action. We are facing this problem in our application.

I think it will be a small change to QCM. What do you think?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Assigned] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K reassigned ZOOKEEPER-975:
----------------------------------

    Assignee: Vishal K

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010522#comment-13010522 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Note there may be another problem in FLE unreleated to this bug. Originally, in QCM instead of sending lastMessage I wanted to send a notification with current state. But I noticed that a peer does not send its real current state in FastLeaderElection.java even when it is not LOOKING.
Heres the relevant code:
                                /*
                                 * If this server is not looking, but the one that sent the ack
                                 * is looking, then send back what it believes to be the leader.
                                 */
                                Vote current = self.getCurrentVote();
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                        LOG.debug("Sending new notification. My id =  " +
                                                self.getId() + ", Recipient = " +
                                                response.sid + " zxid =" +
                                                current.zxid + " leader=" +
                                                current.id);
                                    }
                                    ToSend notmsg = new ToSend(
                                            ToSend.mType.notification,
                                            current.id,
                                            current.zxid,<============ zxid of current vote
                                            logicalclock,
                                            self.getPeerState(),
                                            response.sid);
                                    sendqueue.offer(notmsg);
                                }

It sends zxid of the current vote and not the current zxid seen by the server. The zxid of the vote can be stale. So I dropped that change from QCM changed the size of sendQueue to 1. I think fixing the above code is a separate issue and should be done later, if needed.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010520#comment-13010520 ] 

Hadoop QA commented on ZOOKEEPER-975:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12474451/ZOOKEEPER-975.patch
  against trunk revision 1082362.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/195//console

This message is automatically generated.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011639#comment-13011639 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

The problem I see with mixing notifications from LOOKING peers and FOLLOWING/LEADING peers is the following. Peers that are either LEADING or FOLLOWING won't change their leader state based on new notifications. If a peer receives notifications from a quorum and one of the notifications says FOLLOWING/LEADING, then the peer cannot count on the sender of the notification to move to a new leader. The election will fail and the peer (perhaps along with others) will have to start over. 

Also, in the case a peer gets disconnected and comes back, if there is a quorum following a leader when the peer comes back, it needs to be able to determine who the leader is even if its leader election round is higher than the others. In this case, it has to follow a leader based on FOLLOWING/LEADING notifications, without comparing the votes in the notifications.  

For these reasons, we have kept FOLLOWING/LEADING notifications in a separate data structure. It is true, though, that we could keep all of them in the same structure and simply filter based on the state field when processing.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch3

- added outofelection
- small additional change to drop notifications that do not come peers in voting view.

Testing done:
junit
systest
rebooted peers several times

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012196#comment-13012196 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

I think a peer should check whether the received notification originated from a cluster member. This check is done for LOOKING notifications, but after modifying proposedLeader, xid and epoch. The verification is not performed for leading/following.

I am planning to add this check at the very beginning:

                else if(self.getVotingView().containsKey(n.sid)) {
                    switch (n.state) {
                    case LOOKING:
                     [...]
                }

Do you see any issues with this (especially while using OBSERVER_ID)?





> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch5

Resubmitting the patch. Verifed the patch works on the trunk and all unit tests pass.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991288#comment-12991288 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

ok, I will re-add lastMessage. How about before sending lastMessage() we check if the send queue is empty? If it is empty, then we send lastMesseage, otherwise, we send the message from the queue. This will avoid sending stale messages to the peer.

So the final fix will:
1. Change CAPACITY to 1.
2. send lastMessage to A only if the send queue for A is empty. 

This will ensure that we send the most recent notification and we also handle lost message as before.

Does this sound ok?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017172#comment-13017172 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, What if we have a test with 3 servers with ids 1, 2, and 3. We start 1 and 2, and let 2 be elected. Once 2 is elected, we start 3. With the current trunk code, 3 should declare itself LEADING, since it will receive the initial notifications of the other two processes. With your patch, this situation shouldn't happen.

What do you think?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991486#comment-12991486 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, Sounds like a good way to patch it.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004273#comment-13004273 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

I have a patch for this, but I have it on the top of the fix for ZOOKEEPER-932. We have 932 applied to our ZK code since we need it. Until ZOOKEEPER-932 is reviewed and committed, I will have to keep back porting patches (and do double testing). I will port my changes to trunk if someone requires a fix for the bug. Since this is not a blocker, I am going to hold off the patch until 932 is reviewed. That will reduce my testing and porting overhead. Does that sound ok?

The patch I have is good only for FLE.

{quote}
About maintenance, we have some time back talked about maintaining only the TCP version of FLE (FLE+QCM). There was never some real pressure to eliminate the others, and in fact previously some users were still using LE. I'm all for maintaining only FLE, but we need to hear the opinion of others. More thoughts?
{quote}

The documentation says: "The implementations of leader election 1 and 2 are currently not supported, and we have the intention of deprecating them in the near future. Implementations 0 and 3 are currently supported, and we plan to keep supporting them in the near future. To avoid having to support multiple versions of leader election unecessarily, we may eventually consider deprecating algorithm 0 as well, but we will plan according to the needs of the community."

Is there a significant advantage of using LE 0 vs LE 3?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993428#comment-12993428 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

Do you agree that sending the current state is a better approach?

I as looking at the code to add a method createNotificationMessage() that will return a notification message based on the current state of the server. QCM can call this method and send the notification message at the place where we send lastMessage. However, adding it to FastLeaderElection only wont be enough since there are other leader election algorithms as well. The Election interface is very minimal. Should we add this method to this interface? Which leader election algorithms do we support and is it ok to make this change only for the supported algorithms?

Thanks
-Vishal

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011330#comment-13011330 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

{quote}   1. It does not have a test, but I'm not entirely convinced we should try to implement one, since it might be complex. We should think about it, though;
{quote}

Yes, I didn't include a test. We should have a test, but I don't think I can get to it in the next couple of days or so. I was thinking of feeding the recvQueue with dummy Notifications for scenarios that we wanted to test and then call FastLeaderElection.lookForLeader() and verify the outcome. The tricky part is to come-up with the right sequence of Notifications to test all corner cases.

{quote}
   2. I don't understand why you're removing the outofelection data structure
. I believe the notifications from peers that are FOLLOWING/LEADING should be treated separately from the the ones of peers that are LOOKING. If a peer obtains notifications from a quorum saying that the peers are LEADING/FOLLOWING, then it should follow the leader they point to. If a peer obtains notifications from a quorum of FOLLOWING peers, then it should follow the protocol to select a leader. Consequently, the notifications should be treated separately.
{quote}

I was sure about that change either. I don't mind reintroducing outofelection. But I couldn't think of scenarios where inserting Notification in recvset instead of outofelection will be a problem. Considering that termPredicate verifies that majority vote for same <epoch, sid, zxid>, can you describe a scenario where this change could cause a problem?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated ZOOKEEPER-975:
------------------------------------

      Description: 
Scenario:
1. 2 of the 3 ZK nodes are online
2. Third node is attempting to join
3. Third node unnecessarily goes in "LEADING" state
4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.


While going through the logs I noticed that a peer C that is trying to
join an already formed cluster goes in LEADING state. This is because
QuorumCnxManager of A and B sends the entire history of notification
messages to C. C receives the notification messages that were
exchanged between A and B when they were forming the cluster.

In FastLeaderElection.lookForLeader(), due to the following piece of
code, C quits lookForLeader assuming that it is supposed to lead.

740                             //If have received from all nodes, then terminate
741                             if ((self.getVotingView().size() == recvset.size()) &&
742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
743                                 self.setPeerState((proposedLeader == self.getId()) ?
744                                         ServerState.LEADING: learningState());
745                                 leaveInstance();
746                                 return new Vote(proposedLeader, proposedZxid);
747
748                             } else if (termPredicate(recvset,


This can cause:
1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.

2. C waits for 200 ms (finalizeWait) and then considers whatever
notifications it has received to make a decision. C could potentially
decide to follow an old leader, fail to connect to the leader, and
then restart FLE. See code below.

752                             if (termPredicate(recvset,
753                                     new Vote(proposedLeader, proposedZxid,
754                                             logicalclock))) {
755 
756                                 // Verify if there is any change in the proposed leader
757                                 while((n = recvqueue.poll(finalizeWait,
758                                         TimeUnit.MILLISECONDS)) != null){
759                                     if(totalOrderPredicate(n.leader, n.zxid,
760                                             proposedLeader, proposedZxid)){
761                                         recvqueue.put(n);
762                                         break;
763                                     }
764                                 }



In general, this does not affect correctness of FLE since C will
eventually go back to FOLLOWING state (A and B won't vote for
C). However, this delays C from joining the cluster. This can in turn
affect recovery time of an application.


Proposal: A and B should send only the latest notification (most
recent) instead of the entire history. Does this sound reasonable?





  was:

Scenario:
1. 2 of the 3 ZK nodes are online
2. Third node is attempting to join
3. Third node unnecessarily goes in "LEADING" state
4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.


While going through the logs I noticed that a peer C that is trying to
join an already formed cluster goes in LEADING state. This is because
QuorumCnxManager of A and B sends the entire history of notification
messages to C. C receives the notification messages that were
exchanged between A and B when they were forming the cluster.

In FastLeaderElection.lookForLeader(), due to the following piece of
code, C quits lookForLeader assuming that it is supposed to lead.

740                             //If have received from all nodes, then terminate
741                             if ((self.getVotingView().size() == recvset.size()) &&
742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
743                                 self.setPeerState((proposedLeader == self.getId()) ?
744                                         ServerState.LEADING: learningState());
745                                 leaveInstance();
746                                 return new Vote(proposedLeader, proposedZxid);
747
748                             } else if (termPredicate(recvset,


This can cause:
1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.

2. C waits for 200 ms (finalizeWait) and then considers whatever
notifications it has received to make a decision. C could potentially
decide to follow an old leader, fail to connect to the leader, and
then restart FLE. See code below.

752                             if (termPredicate(recvset,
753                                     new Vote(proposedLeader, proposedZxid,
754                                             logicalclock))) {
755 
756                                 // Verify if there is any change in the proposed leader
757                                 while((n = recvqueue.poll(finalizeWait,
758                                         TimeUnit.MILLISECONDS)) != null){
759                                     if(totalOrderPredicate(n.leader, n.zxid,
760                                             proposedLeader, proposedZxid)){
761                                         recvqueue.put(n);
762                                         break;
763                                     }
764                                 }



In general, this does not affect correctness of FLE since C will
eventually go back to FOLLOWING state (A and B won't vote for
C). However, this delays C from joining the cluster. This can in turn
affect recovery time of an application.


Proposal: A and B should send only the latest notification (most
recent) instead of the entire history. Does this sound reasonable?





    Fix Version/s:     (was: 3.3.3)
                   3.4.0

moved to 3.4

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025877#comment-13025877 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

The problem is not with the patch according to the console output:

{noformat}
[javadoc] javadoc: warning - Error fetching URL: http://java.sun.com/javase/6/docs/api/package-list
{noformat}

so it is +1 for me. If no one has anything against it, I'll commit it later.

Thanks, Vishal.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012666#comment-13012666 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, In the run() method of WorkReceiver we perform a check already:

{noformat}
                        /*
                         * If it is from an observer, respond right away.
                         * Note that the following predicate assumes that
                         * if a server is not a follower, then it must be
                         * an observer. If we ever have any other type of
                         * learner in the future, we'll have to change the
                         * way we check for observers.
                         */
                        if(!self.getVotingView().containsKey(response.sid)){
                            Vote current = self.getCurrentVote();
                            ToSend notmsg = new ToSend(ToSend.mType.notification,
                                    current.id,
                                    current.zxid,
                                    logicalclock,
                                    self.getPeerState(),
                                    response.sid);

                            sendqueue.offer(notmsg);
{noformat}

I'll been reviewing your patch shortly.


> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011331#comment-13011331 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

correction: "I wasn't sure about that change either"

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983331#action_12983331 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Thanks for bringing this up, Vishal. This is not a new observation, although I can't remember if we discussed it in a jira or not. In general, I'm lukewarm about this change. It is certainly not an issue to avoid the server going into LEADING before it goes correctly into LOOKING, but I'm not entirely comfortable with manipulating the queues of notifications. Being able to have two servers concurrently thinking they are leading is a situation supported by our protocols, and such a modification would be an optimization to avoid the unnecessary LEADING step. 

Regarding application recovery time, we don't have a load balance scheme at this point, which could be quite useful, so bringing a new follower up does not guarantee that clients will move their sessions to the new follower. Note that this situation occurs only if there is an ensemble running and a server joins or recovers.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987517#action_12987517 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio, 

Do you think that this will be a problem even after we have the patch  for ZOOKEEPER-932?

This is what ZOOKEEPER-475 describes:
----------
* Replica 1 sends a message to itself and to Replica 2 stating that its current 
vote is for replica 1;
* Replica 2 sends a message to itself and to Replica 1 stating that its current 
vote is for replica 2;
* Replica 1 updates its vote, and sends a message to itself stating that its 
current vote is for replica 2;
* Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 
decides to follow 2.

The problem is that replica 2 does not receive a message from 1 stating that it 
changed its vote to 2, which prevents 2 from becoming a leader. Now looking 
more carefully at why that happened, you can see that when 1 tries to send a 
message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at 
the same time that it is trying to open a new one. The incorrect 
synchronization prevents the creation of a new connection, and 1 and 2 end up 
not connected.   
----------

We  no longer have incorrect synchronization. We can have QCM in 1  shutting down the connection to 2 while it is trying to send a notification to 2. However, the only time 1 will shutdown a connection to 2 is when it receives a new connection request from 2 (or when something is wrong with the connection). A new connection request is received when 2 is trying to send a notification to 1. As a result, 1 will end up sending a notification to 2 saying that it is following 2. Do you agree?

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991939#comment-12991939 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Hi Flavio,

Re-adding lastMessage results in the delay reported in this bug. When node 2 attempts to join a cluster of node 1 and 0, nodes 1 and 0 send a "LOOKING" notification followed by a "FOLLOWING"/"LEADING" notification. After receiving the first pair of LOOKING notifications, 2 goes in the LEADING state.

I think I have a better idea. Instead of sending lastMessage, how about we send the notification for the current state of the peer (just like we do in FastLeaderElection.java)? I think this will resolve both the problems.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990955#comment-12990955 ] 

Flavio Junqueira commented on ZOOKEEPER-975:
--------------------------------------------

Hi Vishal, This is a general concurrency issue that arises when using TCP connections, and I'm not sure we can avoid it even though our protocol is in principle resilient to such message losses. Back to the TCP issue I was mentioning, we remove the message from the queue of messages to send, and send it with a write call to the socket. However, it is not guaranteed that the message will get through, since the destination might drop the connection while the message is in transit. The sender has no way to know if it really got through, unless we have an acknowledgment scheme on top of TCP, which sounds overkill. The way we found to avoid the case in one of the jiras I mentioned above was to resend the last message the peer dequeued for a given destination. This is exactly the mechanism you're proposing to remove.    

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch2

Attaching patch on trunk for review./comments.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012161#comment-13012161 ] 

Vishal K commented on ZOOKEEPER-975:
------------------------------------

Thanks for the clarification. I will reintroduce separate data structure for LEADING/FOLLOWING notifications. 

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Assignee: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, ZOOKEEPER-975.patch2
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal K updated ZOOKEEPER-975:
-------------------------------

    Attachment: ZOOKEEPER-975.patch

I am attaching the patch for review only. This patch is not for final
submission.

I have not written a test yet. I will add a test and resubmit the
patch. This patch is well tested along with patch for 932.  But I need
need to do more testing after separating these changes from 932. All
junit tests pass.

Main changes done:
FastLeaderElection.java
- Removed erroneous exit conditions

QuorumCnxManager.java
- Restrict send queue size to 1 so that we don't
send too many stale messages
- Added functions to add to
ArrayBlockingQueue without blocking while inserting in the
queue if the queue is full.

> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740                             //If have received from all nodes, then terminate
> 741                             if ((self.getVotingView().size() == recvset.size()) &&
> 742                                     (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743                                 self.setPeerState((proposedLeader == self.getId()) ?
> 744                                         ServerState.LEADING: learningState());
> 745                                 leaveInstance();
> 746                                 return new Vote(proposedLeader, proposedZxid);
> 747
> 748                             } else if (termPredicate(recvset,
> This can cause:
> 1.  C to unnecessarily go in LEADING state and wait for tickTime * initLimit and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752                             if (termPredicate(recvset,
> 753                                     new Vote(proposedLeader, proposedZxid,
> 754                                             logicalclock))) {
> 755 
> 756                                 // Verify if there is any change in the proposed leader
> 757                                 while((n = recvqueue.poll(finalizeWait,
> 758                                         TimeUnit.MILLISECONDS)) != null){
> 759                                     if(totalOrderPredicate(n.leader, n.zxid,
> 760                                             proposedLeader, proposedZxid)){
> 761                                         recvqueue.put(n);
> 762                                         break;
> 763                                     }
> 764                                 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira