You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@zookeeper.apache.org by GitBox <gi...@apache.org> on 2020/09/10 21:17:04 UTC
[GitHub] [zookeeper] breed commented on a change in pull request #1444: ZOOKEEPER-3922: The introduction of the oracle, a failure detector.

breed commented on a change in pull request #1444:
URL: https://github.com/apache/zookeeper/pull/1444#discussion_r486632022



##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java
##########
@@ -977,6 +977,20 @@ public Vote lookForLeader() throws InterruptedException {
                     int tmpTimeOut = notTimeout * 2;
                     notTimeout = Math.min(tmpTimeOut, maxNotificationInterval);
                     LOG.info("Notification time out: {}", notTimeout);
+
+                    /*
+                     * When a leader failure happens on a master, the backup will be supposed to receive the honour from
+                     * Oracle and become a leader, but the honour is likely to be delay. We do a re-check once timeout happens
+                     *
+                     * The leader election algorithm does not provide the ability of electing a leader from a single instance
+                     * which is in a configuration of 2 instances.
+                     * */
+                    if (voteSet != null && voteSet.hasAllQuorums() && notTimeout != minNotificationInterval) {

Review comment:
       marking this as a scary part that i need to think about some more. (i haven't noticed anything wrong, but i want to come back to this.)

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/cli/ReconfigCommand.java
##########
@@ -130,7 +130,7 @@ public CliCommand parse(String[] cmdArgs) throws CliParseException {
                 //check that membership makes sense; leader will make these checks again
                 //don't check for leader election ports since
                 //client doesn't know what leader election alg is used
-                members = QuorumPeerConfig.parseDynamicConfig(dynamicCfg, 0, true, false).toString();
+                members = QuorumPeerConfig.parseDynamicConfig(dynamicCfg, 0, true, false, null).toString();

Review comment:
       this is probably the right way to do it, but it would be nice to generalize how we pass the config to the QuorumVerifier. (probably not a change we want to do in this PR, so what you are doing here is OK i think.)

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java
##########
@@ -240,16 +249,47 @@ public void addLearnerHandler(LearnerHandler learner) {
      * @param peer
      */
     @Override
-    public void removeLearnerHandler(LearnerHandler peer) {
-        synchronized (forwardingFollowers) {
-            forwardingFollowers.remove(peer);
-        }
-        synchronized (learners) {
-            learners.remove(peer);
-        }
-        synchronized (observingLearners) {
-            observingLearners.remove(peer);
-        }
+    public synchronized void removeLearnerHandler(LearnerHandler peer) {
+        forwardingFollowers.remove(peer);
+
+        /*
+        * ZOOKEEPER-3922
+        *
+        * We will need to re-validate the outstandingProposal to maintain the progress of ZooKeeper.
+        * It is likely a proposal is waiting for enough ACKs to be committed. The proposals are sent out, but the
+        * only follower goes away which makes the proposals will not be committed until the follower recovers back.
+        * An earlier proposal which is not committed will block any further proposals. So, We need to re-validate those
+        * outstanding proposal with the help from Oracle. A key point in the process of re-validation is that the proposals
+        * need to be processed in order.
+        *
+        * We make the whole method blocking to avoid any possible race condition on outstandingProposal and lastCommitted
+        * as well as to avoid nested synchronization.
+        *
+        * As a more generic approach, we pass the object of forwardingFollowers to QuorumOracleMaj to determine if we need
+        * the help from Oracle.
+        *
+        *
+        * the size of outstandingProposals can be 1. The only one outstanding proposal is the one waiting for the ACK from
+        * the leader.
+        * */
+        self.getQuorumVerifier().updateNeedOracle(new ArrayList<>(forwardingFollowers));

Review comment:
       this method has a lot of synchronization and ordering. is this exactly where this code needs to be? it would be nice to move this to a method that gets called only if we are using a oracle verifier.

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java
##########
@@ -1114,6 +1138,74 @@ public Vote lookForLeader() throws InterruptedException {
         }
     }
 
+    private Vote followingBehavior(Map<Long, Vote> recvset, Map<Long, Vote> outofelection, SyncedLearnerTracker voteSet, Notification n) {
+        /*
+         * Consider all notifications from the same epoch
+         * together.
+         */
+        if (n.electionEpoch == logicalclock.get()) {
+            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
+            voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
+            if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
+                setPeerState(n.leader, voteSet);
+                Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
+                leaveInstance(endVote);
+                return endVote;
+            }
+        }
+
+        /*
+         * Before joining an established ensemble, verify that
+         * a majority are following the same leader.
+         *
+         * Note that the outofelection map also stores votes from the current leader election.
+         * See ZOOKEEPER-1732 for more information.
+         */
+        outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
+        voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
+
+        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
+            synchronized (this) {
+                logicalclock.set(n.electionEpoch);
+                setPeerState(n.leader, voteSet);
+            }
+            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
+            leaveInstance(endVote);
+            return endVote;
+        }
+
+        return null;
+    }
+
+    private Vote leadingBehavior(Map<Long, Vote> recvset, Map<Long, Vote> outofelection, SyncedLearnerTracker voteSet, Notification n) {
+        /*
+        *
+        * In a two-node configuration, a recovery nodes cannot locate a leader because of the lack of the majority in the voteset.
+        * Therefore, it is the time for Oracle to take place as a tight breaker.
+        *
+        * */
+        Vote result = followingBehavior(recvset, outofelection, voteSet, n);
+        if (result == null) {
+            /*
+            * Ask Oracle to see if it is okay to follow this leader.
+            *
+            * We dont need the CheckLeader() because itself cannot be a leader candidate
+            * */
+            if (self.getQuorumVerifier().getNeedOracle() && !self.getQuorumVerifier().askOracle()) {
+                LOG.info("Oracle indicates to follow");

Review comment:
       is this supposed to be debug? also, i think perhaps it should be "Oracle votes for me" ?

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/flexible/QuorumHierarchical.java
##########
@@ -120,6 +122,30 @@ public boolean equals(Object o) {
         }
         return true;
     }
+
+    @Override
+    public boolean askOracle() {
+        LOG.warn("AskOracle is not implemented");

Review comment:
       do we want these warn messages appearing? if it isn't bad, probably not.

##########
File path: zookeeper-server/src/test/java/org/apache/zookeeper/test/QuorumBase.java
##########
@@ -71,13 +79,14 @@
     protected boolean localSessionsEnabled = false;
     protected boolean localSessionsUpgradingEnabled = false;
 
+
     @BeforeEach
     @Override
     public void setUp() throws Exception {
-        setUp(false);
+        setUp(false, true);
     }
 
-    protected void setUp(boolean withObservers) throws Exception {
+    protected void setUp(boolean withObservers, boolean withOracle) throws Exception {

Review comment:
       love the tests! it would be nice if we didn't have to modify every signature with this. (it would have been nice for the withObservers case as well!)

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java
##########
@@ -1114,6 +1138,74 @@ public Vote lookForLeader() throws InterruptedException {
         }
     }
 
+    private Vote followingBehavior(Map<Long, Vote> recvset, Map<Long, Vote> outofelection, SyncedLearnerTracker voteSet, Notification n) {

Review comment:
       i like how you factored this out, but i don't like the name. (i'm terrible with naming.) perhaps followerVoteCalculation?

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java
##########
@@ -470,7 +471,7 @@ protected void pRequest2Txn(int type, long zxid, Request request, Record record,
                     leavingServers = StringUtils.split(leavingServersString, ",");
                 }
 
-                if (!(lastSeenQV instanceof QuorumMaj)) {
+                if (!(lastSeenQV instanceof QuorumMaj) && !(lastSeenQV instanceof QuorumOracleMaj)) {

Review comment:
       is !(lastSeenQV instanceof QuorumMaj || lastSeenQV instanceof QuorumOracleMaj) clearer? (not sure)

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java
##########
@@ -240,16 +249,47 @@ public void addLearnerHandler(LearnerHandler learner) {
      * @param peer
      */
     @Override
-    public void removeLearnerHandler(LearnerHandler peer) {
-        synchronized (forwardingFollowers) {
-            forwardingFollowers.remove(peer);
-        }
-        synchronized (learners) {
-            learners.remove(peer);
-        }
-        synchronized (observingLearners) {
-            observingLearners.remove(peer);
-        }
+    public synchronized void removeLearnerHandler(LearnerHandler peer) {
+        forwardingFollowers.remove(peer);

Review comment:
       you have changed the synchronization. that for the removes. that needs to be fixed.

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java
##########
@@ -1050,43 +1064,53 @@ public Vote lookForLeader() throws InterruptedException {
                     case OBSERVING:
                         LOG.debug("Notification from observer: {}", n.sid);
                         break;
+
+                        /*
+                        * In ZOOKEEPER-3922, we separate the behaviors of FOLLOWING and LEADING.
+                        * To avoid the duplication of codes, we create a method called followingBehavior which was used to
+                        * shared by FOLLOWING and LEADING. This method returns a Vote. When the returned Vote is null, it follows
+                        * the original idea to break swtich statement; otherwise, a valid returned Vote indicates, a leader
+                        * is generated.
+                        *
+                        * The reason why we need to separate these behaviors is to make the algorithm runnable for 2-node
+                        * setting. An extra condition for generating leader is needed. Due to the majority rule, only when
+                        * there is a majority in the voteset, a leader will be generated. However, in a configuration of 2 nodes,
+                        * the number to achieve the majority remains 2, which means a recovered node cannot generate a leader which is
+                        * the existed leader. Therefore, we need the Oracle to kick in this situation. In a two-node configuration, the Oracle
+                        * only grants the permission to maintain the progress to one node. The oracle either grants the permission to the
+                        * remained node and makes it a new leader when there is a faulty machine, which is the case to maintain the progress.
+                        * Otherwise, the oracle does not grant the permission to the remained node, which further causes a service down.
+                        *
+                        * In the former case, when a failed server recovers and participate in the leader election, it would not locate a
+                        * new leader because there does not exist a majority in the voteset. It fails on the containAllQuorum() infinitely due to
+                        * two facts. First one is the fact that it does do not have a majority in the voteset. The other fact is the fact that
+                        * the oracle would not give the permission since the oracle already gave the permission to the existed leader, the healthy machine.
+                        * Logically, when the oracle replies with negative, it implies the existed leader which is LEADING notification comes from is a valid leader.
+                        * To threat this negative replies as a permission to generate the leader is the purpose to separate these two behaviors.
+                        *
+                        *
+                        * */
                     case FOLLOWING:
-                    case LEADING:
                         /*
-                         * Consider all notifications from the same epoch
-                         * together.
-                         */
-                        if (n.electionEpoch == logicalclock.get()) {
-                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
-                            voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
-                            if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
-                                setPeerState(n.leader, voteSet);
-                                Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
-                                leaveInstance(endVote);
-                                return endVote;
-                            }
+                        * To avoid duplicate codes
+                        * */
+                        Vote result_fb = followingBehavior(recvset, outofelection, voteSet, n);
+                        if (result_fb == null) {
+                            break;
+                        } else {
+                            return result_fb;
                         }
-
+                    case LEADING:
                         /*
-                         * Before joining an established ensemble, verify that
-                         * a majority are following the same leader.
-                         *
-                         * Note that the outofelection map also stores votes from the current leader election.
-                         * See ZOOKEEPER-1732 for more information.
-                         */
-                        outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
-                        voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
-
-                        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
-                            synchronized (this) {
-                                logicalclock.set(n.electionEpoch);
-                                setPeerState(n.leader, voteSet);
-                            }
-                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
-                            leaveInstance(endVote);
-                            return endVote;
+                        * In leadingBehavior(), it performs followingBehvior() first. When followingBehavior() returns
+                        * a null pointer, ask Oracle whether to follow this leader.
+                        * */
+                        Vote result_lb = leadingBehavior(recvset, outofelection, voteSet, n);

Review comment:
       use camelCase not _

##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/flexible/QuorumOracleMaj.java
##########
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zookeeper.server.quorum.flexible;
+
+
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicBoolean;
+import org.apache.zookeeper.server.quorum.LearnerHandler;
+import org.apache.zookeeper.server.quorum.QuorumPeer;
+import org.apache.zookeeper.server.quorum.QuorumPeerConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/*
+ *
+ * QuorumOracleMaj is a subclass of QuorumMaj.
+ *
+ * QuorumOracleMaj is designed to be functional in a 2-nodes configuration. The only method that this class overrides super
+ * class' method is containsQuorum(). Besides the check of oracle, it also checks the number of voting member. Whenever the
+ * number of voting members is greater than 2. QuorumOracleMaj shall function as hook to its super class.
+ * */
+public class QuorumOracleMaj extends QuorumMaj {
+    private static final Logger LOG = LoggerFactory.getLogger(QuorumOracleMaj.class);
+
+    private String oracle;
+
+    private AtomicBoolean needOracle;
+
+    public QuorumOracleMaj(Map<Long, QuorumPeer.QuorumServer> allMembers, String oraclePath) {
+        super(allMembers);
+        needOracle  = new AtomicBoolean(false);
+        oracle = null;
+        setOracle(oraclePath);
+    }
+
+    public QuorumOracleMaj(Properties props, String oraclePath) throws QuorumPeerConfig.ConfigException {
+        super(props);
+        needOracle  = new AtomicBoolean(false);
+        oracle = null;
+        setOracle(oraclePath);
+    }
+
+    private void setOracle(String path) {
+        if (oracle == null) {
+            oracle = path;
+            LOG.info("Oracle is set to {}", path);
+        } else {
+            LOG.warn("Oracle is already set. Ignore:{}", path);
+        }
+    }
+
+    @Override
+    public void updateNeedOracle(List<LearnerHandler> forwardingFollowers) {
+        // We do not have any follower, and our voting member is 2
+        needOracle.set(forwardingFollowers.isEmpty() && super.getVotingMembers().size() == 2);
+    }
+
+    @Override
+    public boolean askOracle() {
+
+        FileReader fr = null;
+        try {
+            int read;
+            fr = new FileReader(oracle);
+            read = fr.read();
+            LOG.debug("Oracle says:{}", (char) read);
+            return (char) read == '1';
+        } catch (Exception e) {
+            if (oracle == null) {
+                e.printStackTrace();
+                LOG.error("Oracle is not set, return false");
+            }
+            return false;
+        } finally {
+            if (fr != null) {
+                try {
+                    fr.close();
+                } catch (IOException e) {
+                    LOG.debug("Fail to close inputStream");
+                }
+            }
+        }
+    }
+
+    @Override
+    public boolean getNeedOracle() {
+        return needOracle.get() && super.getVotingMembers().size() == 2;
+    }
+
+    @Override
+    public String getOraclePath() {
+        return oracle;
+    }
+
+    @Override
+    public boolean containsQuorum(Set<Long> ackSet) {
+        if (oracle == null || getVotingMembers().size() > 2) {
+            return super.containsQuorum(ackSet);
+        } else if (!super.containsQuorum(ackSet)) {
+            if (getNeedOracle()) {
+                LOG.debug("We lose the quorum, but we do not have any valid followers Oracle:{}", askOracle());
+                return askOracle();
+            } else {
+                return false;
+            }
+        } else {
+            return true;
+        }
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (o == null || getClass() != o.getClass()) {
+            return false;
+        }
+        QuorumOracleMaj qm = (QuorumOracleMaj) o;
+        if (qm.getVersion() == super.getVersion()) {
+            return true;
+        }
+        if (super.getAllMembers().size() != qm.getAllMembers().size()) {
+            return false;
+        }
+        for (QuorumPeer.QuorumServer qs : super.getAllMembers().values()) {
+            QuorumPeer.QuorumServer qso = qm.getAllMembers().get(qs.id);
+            if (qso == null || !qs.equals(qso)) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public int hashCode() {

Review comment:
       why don't we just use the default implementation if it is arbitrary?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org