You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by lvfangmin <gi...@git.apache.org> on 2018/11/17 06:35:19 UTC

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

GitHub user lvfangmin opened a pull request:

    https://github.com/apache/zookeeper/pull/703

    [ZOOKEEPER-1818] Correctly handle potential inconsistent zxid/electionEpoch and peerEpoch during leader election

    This is similar to the 3.4 implementation in Jira ZOOKEEPER-1817, main differences with that:
    
    1. removed bcVote, in Vote.equals it will skip compare peerEpoch when one of the version is 0x0.
    2. added detailed scenarios which we tried to solve with peerEpoch update and the change in Vote.equals.
    3. removed ooePredicate with inlined one, since master code is tracking voteSet..
    4. improved the tests to make it cleaner.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-1818

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/703.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #703
    
----
commit ccccac2aa6147e5433567a4be433089c35a30a3d
Author: Fangmin Lyu <fa...@...>
Date:   2018-11-15T17:46:51Z

    Correctly handle potential inconsitent zxid/electionEpoch and peerEpoch during leader election

----


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r238742520
  
    --- Diff: zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/FLEOutOfElectionTest.java ---
    @@ -0,0 +1,136 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.zookeeper.server.quorum;
    +
    +import java.io.File;
    +import java.net.InetSocketAddress;
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import org.apache.zookeeper.PortAssignment;
    +import org.apache.zookeeper.server.quorum.QuorumPeer;
    +import org.apache.zookeeper.server.quorum.FastLeaderElection.Notification;
    +import org.apache.zookeeper.server.quorum.Vote;
    +import org.apache.zookeeper.server.quorum.QuorumPeer.QuorumServer;
    +import org.apache.zookeeper.server.quorum.QuorumPeer.ServerState;
    +import org.apache.zookeeper.server.util.ZxidUtils;
    +import org.apache.zookeeper.test.ClientBase;
    +import org.apache.zookeeper.test.FLETest;
    +import org.junit.After;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +/**
    + * Test FastLeaderElection with out of election servers.
    + */
    +public class FLEOutOfElectionTest {
    +
    +    private FastLeaderElection fle;
    +
    +    @Before
    +    public void setUp() throws Exception {
    +        File tmpdir = ClientBase.createTmpDir();
    +        Map<Long, QuorumServer> peers = new HashMap<Long,QuorumServer>();
    +        for(int i = 0; i < 5; i++) {
    +            peers.put(Long.valueOf(i), new QuorumServer(Long.valueOf(i), 
    +                    new InetSocketAddress("127.0.0.1", PortAssignment.unique())));
    +        }
    +        QuorumPeer peer = new QuorumPeer(peers, tmpdir, tmpdir, 
    +                PortAssignment.unique(), 3, 3, 1000, 2, 2);
    +        fle = new FastLeaderElection(peer, peer.createCnxnManager());
    +    }
    +
    +    @Test
    +    public void testIgnoringZxidElectionEpoch() {
    +        Map<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 1), 1, 2, ServerState.FOLLOWING));
    +        votes.put(1L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 2), 1, 2, ServerState.FOLLOWING));
    +        votes.put(3L, new Vote(0x1, 4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING));
    +        votes.put(4L, new Vote(0x1, 4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LEADING));
    +
    +        Assert.assertTrue(fle.getVoteTracker(votes, 
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testElectionWIthDifferentVersion() {
    +        Map<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 1), 1, 1, ServerState.FOLLOWING));
    +        votes.put(1L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 1), 1, 1, ServerState.FOLLOWING));
    +        votes.put(3L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING));
    +        votes.put(4L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LEADING));
    +
    +        Assert.assertTrue(fle.getVoteTracker(votes, 
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testLookingNormal() {
    +        Map<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(1L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(3L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(4L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LEADING));
    +
    +        Assert.assertTrue(fle.getVoteTracker(votes, 
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testLookingDiffRounds() {
    +        HashMap<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(4L, ZxidUtils.makeZxid(1, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(1L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LOOKING));
    +        votes.put(3L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 3, 2, ServerState.LOOKING));
    +        votes.put(4L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 3, 2, ServerState.LEADING));
    +
    +        Assert.assertFalse(fle.getVoteTracker(votes,
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LOOKING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testOutofElection() {
    +        HashMap<Long,Vote> outofelection = new HashMap<Long,Vote>();
    +
    +        outofelection.put(1L, new Vote(0x0, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x17, ServerState.FOLLOWING));
    +        outofelection.put(2L, new Vote(0x0, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x17, ServerState.FOLLOWING));
    +        outofelection.put(4L, new Vote(0x1, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x18, ServerState.FOLLOWING));
    +        Vote vote = new Vote(0x1, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x18, ServerState.LEADING);
    +        outofelection.put(5L, vote);
    +
    +        Notification n = new Notification();
    +        n.version = vote.getVersion();
    +        n.leader = vote.getId();
    +        n.zxid = vote.getZxid();
    +        n.electionEpoch = vote.getElectionEpoch();
    +        n.state = vote.getState();
    +        n.peerEpoch = vote.getPeerEpoch();
    +        n.sid = 5L;
    +        
    +        // Set the logical clock to 1 on fle instance of server 3.
    +        fle.logicalclock.set(0x1);
    +
    +        Assert.assertTrue("Termination predicate failed", 
    --- End diff --
    
    Forgot to make this change last time, will update.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237367761
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java ---
    @@ -1058,29 +1050,20 @@ else if (validVoter(n.sid) && validVoter(n.leader)) {
                             /*
                              * Before joining an established ensemble, verify that
                              * a majority are following the same leader.
    -                         * Only peer epoch is used to check that the votes come
    -                         * from the same ensemble. This is because there is at
    -                         * least one corner case in which the ensemble can be
    -                         * created with inconsistent zxid and election epoch
    -                         * info. However, given that only one ensemble can be
    -                         * running at a single point in time and that each
    -                         * epoch is used only once, using only the epoch to
    -                         * compare the votes is sufficient.
    -                         *
    -                         * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
                              */
    -                        outofelection.put(n.sid, new Vote(n.leader,
    -                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
    -                        voteSet = getVoteTracker(
    -                                outofelection, new Vote(n.leader,
    -                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
    -                        if (voteSet.hasAllQuorums()
    -                                && checkLeader(outofelection, n.leader, IGNOREVALUE)) {
    +                        outofelection.put(n.sid, new Vote(n.version, n.leader,
    +                                n.zxid, n.electionEpoch, n.peerEpoch, n.state));
    +                        voteSet = getVoteTracker(outofelection, new Vote(n.version, 
    +                                n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
    +
    +                        if (voteSet.hasAllQuorums() &&
    +                                checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                 synchronized(this){
                                     logicalclock.set(n.electionEpoch);
                                     setPeerState(n.leader, voteSet);
                                 }
    -                            Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
    +                            Vote endVote = new Vote(n.leader, n.zxid, 
    +                                    n.electionEpoch, n.peerEpoch);
    --- End diff --
    
    nit: indentation


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r238085234
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -1989,6 +1989,38 @@ private boolean updateVote(long designatedLeader, long zxid){
         /**
          * Updates leader election info to avoid inconsistencies when
          * a new server tries to join the ensemble.
    +     *
    +     * Here is the inconsistency scenario we try to solve by updating the peer 
    +     * epoch after following leader:
    +     *
    +     * Let's say we have an ensemble with 3 servers z1, z2 and z3.
    +     *
    +     * 1. z1, z2 were following z3 with peerEpoch to be 0xb8, the new epoch is 
    +     *    0xb9, aka current accepted epoch on disk.
    +     * 2. z2 get restarted, which will use 0xb9 as it's peer epoch when loading
    +     *    the current accept epoch from disk.
    +     * 3. z2 received notification from z1 and z3, which is following z3 with 
    +     *    epoch 0xb8, so it started following z3 again with peer epoch 0xb8.
    +     * 4. before z2 successfully connected to z3, z3 get restarted with new 
    +     *    epoch 0xb9.
    +     * 5. z2 will retry around a few round (default 5s) before giving up, 
    +     *    meanwhile it will report z3 as leader.
    +     * 6. z1 restarted, and looking with peer epoch 0xb9.
    +     * 7. z1 voted z3, and z3 was elected as leader again with peer epoch 0xb9.
    +     * 8. z2 successfully connected to z3 before giving up, but with peer 
    +     *    epoch 0xb8.
    +     * 9. z1 get restarted, looking for leader with peer epoch 0xba, but cannot 
    +     *    join, because z2 is reporting peer epoch 0xb8, while z3 is reporting 
    +     *    0xb9.
    +     *
    +     * By updating the election vote after actually following leader, we can 
    --- End diff --
    
    It's align with the function name which is updating the vote, although we only updated the electionEpoch here.


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    @anmolnar this is the fix for the DONTCARE on trunk, please take a look, I'll port it to 3.5 when it's being reviewed and merged.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237368267
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Vote.java ---
    @@ -125,11 +125,47 @@ public boolean equals(Object o) {
                 return false;
             }
             Vote other = (Vote) o;
    -        return (id == other.id
    +
    +        if ((state == ServerState.LOOKING) ||
    +                (other.state == ServerState.LOOKING)) {
    +            return (id == other.id
                         && zxid == other.zxid
                         && electionEpoch == other.electionEpoch
                         && peerEpoch == other.peerEpoch);
    -
    +        } else {
    +            /*
    +             * There are two things going on in the logic below:
    +             * 
    +             * 1. skip comparing the zxid and electionEpoch for votes for servers 
    +             *    out of election. 
    +             *    
    +             *    Need to skip those because they can be inconsistent due to  
    +             *    scenarios described in QuorumPeer.updateElectionVote. 
    +             *
    +             *    And given that only one ensemble can be running at a single point 
    +             *    in time and that each epoch is used only once, using only id and 
    +             *    epoch to compare the votes is sufficient.
    +             *
    --- End diff --
    
    maybe move the reference of {@see https://issues.apache.org/jira/browse/ZOOKEEPER-1805} from bottom to here, save some brain cycles for pattern matching between comments and issues.


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    Committed to master branch. Thanks @lvfangmin !
    Please create another pull request for branch-3.5.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r238085247
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java ---
    @@ -1023,9 +1016,9 @@ else if (validVoter(n.sid) && validVoter(n.leader)) {
                                  */
                                 if (n == null) {
                                     setPeerState(proposedLeader, voteSet);
    -
                                     Vote endVote = new Vote(proposedLeader,
    -                                        proposedZxid, proposedEpoch);
    +                                        proposedZxid, logicalclock.get(), 
    --- End diff --
    
    @hanm for the general indention after { we use 4 white spaces, but for the statement breaking we usually use 8 white spaces, isn't that true for ZK codebase, I saw the previous code is using this rule.
    
    Let me know if this is not true, I can update those.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237367857
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -1989,6 +1989,38 @@ private boolean updateVote(long designatedLeader, long zxid){
         /**
          * Updates leader election info to avoid inconsistencies when
          * a new server tries to join the ensemble.
    +     *
    +     * Here is the inconsistency scenario we try to solve by updating the peer 
    +     * epoch after following leader:
    +     *
    +     * Let's say we have an ensemble with 3 servers z1, z2 and z3.
    +     *
    +     * 1. z1, z2 were following z3 with peerEpoch to be 0xb8, the new epoch is 
    +     *    0xb9, aka current accepted epoch on disk.
    +     * 2. z2 get restarted, which will use 0xb9 as it's peer epoch when loading
    +     *    the current accept epoch from disk.
    +     * 3. z2 received notification from z1 and z3, which is following z3 with 
    +     *    epoch 0xb8, so it started following z3 again with peer epoch 0xb8.
    +     * 4. before z2 successfully connected to z3, z3 get restarted with new 
    +     *    epoch 0xb9.
    +     * 5. z2 will retry around a few round (default 5s) before giving up, 
    +     *    meanwhile it will report z3 as leader.
    +     * 6. z1 restarted, and looking with peer epoch 0xb9.
    +     * 7. z1 voted z3, and z3 was elected as leader again with peer epoch 0xb9.
    +     * 8. z2 successfully connected to z3 before giving up, but with peer 
    +     *    epoch 0xb8.
    +     * 9. z1 get restarted, looking for leader with peer epoch 0xba, but cannot 
    +     *    join, because z2 is reporting peer epoch 0xb8, while z3 is reporting 
    +     *    0xb9.
    +     *
    +     * By updating the election vote after actually following leader, we can 
    --- End diff --
    
    should be 'election epoch' or 'electionEpoch' rather than 'election vote'.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237367977
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -1989,6 +1989,38 @@ private boolean updateVote(long designatedLeader, long zxid){
         /**
          * Updates leader election info to avoid inconsistencies when
          * a new server tries to join the ensemble.
    +     *
    +     * Here is the inconsistency scenario we try to solve by updating the peer 
    +     * epoch after following leader:
    +     *
    +     * Let's say we have an ensemble with 3 servers z1, z2 and z3.
    +     *
    +     * 1. z1, z2 were following z3 with peerEpoch to be 0xb8, the new epoch is 
    +     *    0xb9, aka current accepted epoch on disk.
    +     * 2. z2 get restarted, which will use 0xb9 as it's peer epoch when loading
    +     *    the current accept epoch from disk.
    +     * 3. z2 received notification from z1 and z3, which is following z3 with 
    +     *    epoch 0xb8, so it started following z3 again with peer epoch 0xb8.
    +     * 4. before z2 successfully connected to z3, z3 get restarted with new 
    +     *    epoch 0xb9.
    +     * 5. z2 will retry around a few round (default 5s) before giving up, 
    +     *    meanwhile it will report z3 as leader.
    +     * 6. z1 restarted, and looking with peer epoch 0xb9.
    +     * 7. z1 voted z3, and z3 was elected as leader again with peer epoch 0xb9.
    +     * 8. z2 successfully connected to z3 before giving up, but with peer 
    +     *    epoch 0xb8.
    +     * 9. z1 get restarted, looking for leader with peer epoch 0xba, but cannot 
    +     *    join, because z2 is reporting peer epoch 0xb8, while z3 is reporting 
    +     *    0xb9.
    +     *
    +     * By updating the election vote after actually following leader, we can 
    +     * avoid this kind of stuck happened.
    +     *
    +     * Btw, the zxid and electionEpoch could be inconsistent because of the same 
    +     * reason, it's better to update these as well after syncing with leader, but 
    +     * that required protocol change which is non trivial. This problem is worked 
    +     * around by skipping comparing the zxid and electionEpoch when counting for 
    +     * votes for out of election servers during looking for leader.
          * 
          * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
    --- End diff --
    
    would this be better syntax wise: {@see https://issues.apache.org/jira/browse/ZOOKEEPER-1732}


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    @lvfangmin lgtm, thanks!


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    Thanks @lvfangmin for working on this.
    
    To be honest I'm lost following how FLE works and additionally what issue is this patch intended to fix. I only ran a few blind test with manually restarting ZK instances in a 3-node ensemble and haven't seen any issue with joining so far.
    
    I'd like to do some more stress testing on this and get an understanding on how FLE works before committing. Hopefully early next week. First of all, what's the difference between current and accepted epoch and how election round is related?
    
    



---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/zookeeper/pull/703


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237367990
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -1989,6 +1989,38 @@ private boolean updateVote(long designatedLeader, long zxid){
         /**
          * Updates leader election info to avoid inconsistencies when
          * a new server tries to join the ensemble.
    +     *
    +     * Here is the inconsistency scenario we try to solve by updating the peer 
    --- End diff --
    
    awesome comments.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237367684
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java ---
    @@ -1042,14 +1035,13 @@ else if (validVoter(n.sid) && validVoter(n.leader)) {
                              */
                             if(n.electionEpoch == logicalclock.get()){
                                 recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
    -                            voteSet = getVoteTracker(
    -                                    recvset, new Vote(n.leader, n.zxid,
    -                                    n.electionEpoch, n.peerEpoch, n.state));
    -                            if(voteSet.hasAllQuorums()
    -                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {
    +                            voteSet = getVoteTracker(recvset, new Vote(n.version, 
    +                                      n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
    +                            if (voteSet.hasAllQuorums() && 
    +                                    checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                     setPeerState(n.leader, voteSet);
    -
    -                                Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
    +                                Vote endVote = new Vote(n.leader, 
    +                                        n.zxid, n.electionEpoch, n.peerEpoch);
    --- End diff --
    
    nit: indentation


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    May need an extra +1 from another committer, @hanm do you have time to review this?


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237368501
  
    --- Diff: zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/FLEOutOfElectionTest.java ---
    @@ -0,0 +1,136 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.zookeeper.server.quorum;
    +
    +import java.io.File;
    +import java.net.InetSocketAddress;
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import org.apache.zookeeper.PortAssignment;
    +import org.apache.zookeeper.server.quorum.QuorumPeer;
    +import org.apache.zookeeper.server.quorum.FastLeaderElection.Notification;
    +import org.apache.zookeeper.server.quorum.Vote;
    +import org.apache.zookeeper.server.quorum.QuorumPeer.QuorumServer;
    +import org.apache.zookeeper.server.quorum.QuorumPeer.ServerState;
    +import org.apache.zookeeper.server.util.ZxidUtils;
    +import org.apache.zookeeper.test.ClientBase;
    +import org.apache.zookeeper.test.FLETest;
    +import org.junit.After;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +/**
    + * Test FastLeaderElection with out of election servers.
    + */
    +public class FLEOutOfElectionTest {
    +
    +    private FastLeaderElection fle;
    +
    +    @Before
    +    public void setUp() throws Exception {
    +        File tmpdir = ClientBase.createTmpDir();
    +        Map<Long, QuorumServer> peers = new HashMap<Long,QuorumServer>();
    +        for(int i = 0; i < 5; i++) {
    +            peers.put(Long.valueOf(i), new QuorumServer(Long.valueOf(i), 
    +                    new InetSocketAddress("127.0.0.1", PortAssignment.unique())));
    +        }
    +        QuorumPeer peer = new QuorumPeer(peers, tmpdir, tmpdir, 
    +                PortAssignment.unique(), 3, 3, 1000, 2, 2);
    +        fle = new FastLeaderElection(peer, peer.createCnxnManager());
    +    }
    +
    +    @Test
    +    public void testIgnoringZxidElectionEpoch() {
    +        Map<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 1), 1, 2, ServerState.FOLLOWING));
    +        votes.put(1L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 2), 1, 2, ServerState.FOLLOWING));
    +        votes.put(3L, new Vote(0x1, 4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING));
    +        votes.put(4L, new Vote(0x1, 4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LEADING));
    +
    +        Assert.assertTrue(fle.getVoteTracker(votes, 
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testElectionWIthDifferentVersion() {
    +        Map<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 1), 1, 1, ServerState.FOLLOWING));
    +        votes.put(1L, new Vote(0x1, 4L, ZxidUtils.makeZxid(1, 1), 1, 1, ServerState.FOLLOWING));
    +        votes.put(3L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING));
    +        votes.put(4L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LEADING));
    +
    +        Assert.assertTrue(fle.getVoteTracker(votes, 
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.FOLLOWING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testLookingNormal() {
    +        Map<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(1L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(3L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(4L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LEADING));
    +
    +        Assert.assertTrue(fle.getVoteTracker(votes, 
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 1, 1, ServerState.LOOKING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testLookingDiffRounds() {
    +        HashMap<Long, Vote> votes = new HashMap<Long, Vote>();
    +        votes.put(0L, new Vote(4L, ZxidUtils.makeZxid(1, 1), 1, 1, ServerState.LOOKING));
    +        votes.put(1L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LOOKING));
    +        votes.put(3L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 3, 2, ServerState.LOOKING));
    +        votes.put(4L, new Vote(4L, ZxidUtils.makeZxid(2, 1), 3, 2, ServerState.LEADING));
    +
    +        Assert.assertFalse(fle.getVoteTracker(votes,
    +                new Vote(4L, ZxidUtils.makeZxid(2, 1), 2, 2, ServerState.LOOKING)).hasAllQuorums());
    +    }
    +
    +    @Test
    +    public void testOutofElection() {
    +        HashMap<Long,Vote> outofelection = new HashMap<Long,Vote>();
    +
    +        outofelection.put(1L, new Vote(0x0, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x17, ServerState.FOLLOWING));
    +        outofelection.put(2L, new Vote(0x0, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x17, ServerState.FOLLOWING));
    +        outofelection.put(4L, new Vote(0x1, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x18, ServerState.FOLLOWING));
    +        Vote vote = new Vote(0x1, 5, ZxidUtils.makeZxid(15, 0), 0xa, 0x18, ServerState.LEADING);
    +        outofelection.put(5L, vote);
    +
    +        Notification n = new Notification();
    +        n.version = vote.getVersion();
    +        n.leader = vote.getId();
    +        n.zxid = vote.getZxid();
    +        n.electionEpoch = vote.getElectionEpoch();
    +        n.state = vote.getState();
    +        n.peerEpoch = vote.getPeerEpoch();
    +        n.sid = 5L;
    +        
    +        // Set the logical clock to 1 on fle instance of server 3.
    +        fle.logicalclock.set(0x1);
    +
    +        Assert.assertTrue("Termination predicate failed", 
    --- End diff --
    
    I don't think we are testing any predicates here - maybe this was from the old patch / 3.4 where the `termPredicate` was still active. Suggest to rephrase "Termination predicate failed" to something else.


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    @lvfangmin will review and provide feedback tonight. thanks


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    Here is the PR for 3.5: #714, thanks @mkedwards.


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    Thanks @anmolnar, I'll send out the PR for 3.5.


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    @hanm is the latest change looking good to you?


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r239309222
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java ---
    @@ -1023,9 +1016,9 @@ else if (validVoter(n.sid) && validVoter(n.leader)) {
                                  */
                                 if (n == null) {
                                     setPeerState(proposedLeader, voteSet);
    -
                                     Vote endVote = new Vote(proposedLeader,
    -                                        proposedZxid, proposedEpoch);
    +                                        proposedZxid, logicalclock.get(), 
    --- End diff --
    
    sgtm to keep consistent with existing code style.


---

[GitHub] zookeeper pull request #703: [ZOOKEEPER-1818] Correctly handle potential inc...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/703#discussion_r237367661
  
    --- Diff: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java ---
    @@ -1023,9 +1016,9 @@ else if (validVoter(n.sid) && validVoter(n.leader)) {
                                  */
                                 if (n == null) {
                                     setPeerState(proposedLeader, voteSet);
    -
                                     Vote endVote = new Vote(proposedLeader,
    -                                        proposedZxid, proposedEpoch);
    +                                        proposedZxid, logicalclock.get(), 
    --- End diff --
    
    nit: indentations looks a little bit off here?


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    @lvfangmin Thanks for the clarification!
    Still testing your patch, should be done by tomorrow.


---

[GitHub] zookeeper issue #703: [ZOOKEEPER-1818] Correctly handle potential inconsiste...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/703
  
    @anmolnar following are my understanding about the acceptedEpoch, currentEpoch and electionEpoch:
    
    * acceptedEpoch : the previous epoch we accepted so far, usually is the epoch is the highest zxid on that server.
    * currentEpoch  : the current epoch after syncing with the new leader, it's based on the maximum acceptedEpoch in the quorum, and usually it's the max(acceptedEpoch) + 1. The currentEpoch is used as the peerEpoch in the leader election, as we know (sid, zxid, peerEpoch) are the set used to decide a leader.
    * electionEpoch : not part of the factors to decide leader, but it's used as a logical clock to avoid considering a vote delayed from a while ago.
    
    Basically, we know there is a corner case where the learner may not update it's zxid, peerEpoch, and electionEpoch after leader election (check the new comment I added Leader.updateElectionVote), peerEpoch is fixed with a hack solution, but we cannot easily update the zxid and electionEpoch, so we try to ignore it. But IGNOREVALUE introduced will have compatible issue when rolling upgrade ensemble, that's why we introduced version in notification, and only compare id or peerEpoch based on version.


---