You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2016/03/29 00:20:50 UTC

[1/2] incubator-kudu git commit: Import raft config change design doc

Repository: incubator-kudu
Updated Branches:
  refs/heads/master 457f6fb0d -> c693d878f


Import raft config change design doc

This imports a design doc originally written by Mike from the following
URL (only Cloudera-accessible):

https://docs.google.com/document/d/12y4ZSn5-mLCQtYsM_ccQ6FjsXlVwbxmHVi-ieMBV4Ro/edit#

Change-Id: If38f9893fce54d26c321daa654bca17ab9265bcb
Reviewed-on: http://gerrit.cloudera.org:8080/2635
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/0f48f5e2
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/0f48f5e2
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/0f48f5e2

Branch: refs/heads/master
Commit: 0f48f5e2fc465b0486b270f7cc5e791cb5cdae41
Parents: 457f6fb
Author: Todd Lipcon <to...@apache.org>
Authored: Fri Mar 25 14:10:32 2016 -0700
Committer: Todd Lipcon <to...@apache.org>
Committed: Mon Mar 28 22:20:18 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/raft-config-change.md | 278 ++++++++++++++++++++++++++++
 1 file changed, 278 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/0f48f5e2/docs/design-docs/raft-config-change.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/raft-config-change.md b/docs/design-docs/raft-config-change.md
new file mode 100644
index 0000000..f6df406
--- /dev/null
+++ b/docs/design-docs/raft-config-change.md
@@ -0,0 +1,278 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Kudu quorum membership change design
+
+For operability and availability reasons, we want to be able to
+dynamically change the set of tablet servers that host a given Kudu
+tablet. The use cases for this functionality include:
+
+* Replacing a failed tablet server to maintain the desired replication
+  factor of tablet data.
+
+* Growing the Kudu cluster over time.
+
+* "Rebalancing" tablet locations to even out the load across tablet
+  servers.
+
+* Increasing the replication of one or more tablets of a table if they
+  become hot (eg in a time series workload, making today’s partitions
+  have a higher replication)
+
+
+## Scope
+This document covers the following topics:
+
+* Design and implementation of quorum membership change at the
+  consensus level.
+
+* Process for adding / removing a tablet server to / from a running
+  tablet quorum.
+
+* Process for moving a tablet replica from one tablet server to
+  another.
+
+* Process for restoring availability while attempting to minimize data
+  loss after catastrophic failure (permanent failure of a majority of
+  the nodes). Since there can be no guarantee or bound on the amount
+  of data that may be lost in such a scenario, we only provide a high
+  level approach to allow for attempting a manual repair.
+
+## References
+[1] [Raft paper](https://ramcloud.stanford.edu/raft.pdf)
+
+[2] [Raft cluster membership changes (summarizes extensions from
+Raft author’s PhD thesis)](https://docs.google.com/a/cloudera.com/document/d/14LLUHWmr17_7iSrMRmpzgCx9OLQKFxEsPABs0-JBGRU)
+
+[3] [Design review notes](https://docs.google.com/a/cloudera.com/document/d/1q6S7Z3PzZUq8jCgCOJzaPohFzvZyfHhWHLvoQUOxj_U)
+
+We reference [2] a lot in this doc.
+
+
+## Quorum membership change
+In Kudu, we change quorum membership following the one-by-one
+membership change design [2] from Diego Ongaro’s PhD thesis. We
+provide a rough outline of the one-by-one design as outlined in the
+thesis, however this doc is mostly concerned with the Kudu-specific
+details and deviations from Raft.
+
+### One-by-one membership change
+
+We can only make one addition or subtraction to the quorum
+atomically. Until one such change (i.e. change config transaction)
+commits or aborts, no others may be started. This gives us safety
+guarantees. The proof is outlined in [2].
+
+### Process for adding a new node to the cluster
+
+This process is executed by a driver, which may be a client program or
+the Master. We’ll say the node to be added to the cluster is named
+`new_node`.
+
+1. Driver initiates execution of remote bootstrap procedure of
+`new_node` from the current leader bootstrap_source using an RPC call to
+the `new_node`. Remote bootstrap runs to completion, which means all
+data and logs at the time remote bootstrap was initiated were
+replicated to `new_node`. Driver polls `new_node` for indication that the
+remote bootstrap process is complete.
+
+If the bootstrap_source node crashes before remote bootstrap is
+complete, the bootstrap fails and the driver must start the entire
+process over from the beginning. If the driver or `new_node` crashes and
+the tablet never joins the quorum, the Master should eventually delete
+the abandoned tablet replica from `new_node`.
+
+2. Driver invokes the AddServer() RPC call on the leader to add
+`new_node` as a `PRE_FOLLOWER` to the quorum. This is a new role type,
+which does not have voting rights. Replicate this config change
+through the cluster (does not change voting majority). The leader will
+automatically transition a `PRE_FOLLOWER` to a `FOLLOWER` (with voting
+rights, implying a potential majority change) when it detects
+`new_node` has caught up sufficiently to replicate the remaining log
+entries within an election timeout (see [2] section 4.2.1). Several
+nodes may be in `PRE_FOLLOWER` mode at a given time, but when
+transitioning to `FOLLOWER` the one-by-one rules still apply.
+
+Failure to add the node as a `PRE_FOLLOWER` (usually due to a leader
+change or weakness in the quorum) will require a retry later by the
+driver.
+
+3. As soon as a replica receives the ConfigChangeRequest it applies
+the quorum change in-memory. It does not wait for commitment to apply
+the change. See rationale in [2] section 4.1.
+
+4. The remote bootstrap session between `new_node` and
+bootstrap_source is closed once the config change to transition the
+node to `PRE_FOLLOWER` has been committed. This implies releasing an
+anchor on the log. Since `new_node` is already a member of the quorum
+receiving log updates, it should hold a log anchor on the leader
+starting at the as-yet unreplicated data, so this overlap is safe
+[TODO: this may not yet be implemented, need to check].
+
+5. Eventually the ConfigChangeTransaction is committed and the
+membership change is made durable.
+
+### Config change transaction implementation details
+
+When a config change transaction is received indicating a membership
+change, we apply the change as WIP config change without committing it
+to disk. Consensus commit of the ChangeConfigTransaction causes us to
+sync ConsensusMeta to disk (Raft relies on the log durability but we
+don’t want to prevent log GC due to config change entries).
+
+This approach allows us to "roll back" to the last-committed quorum
+membership in the case that a change config transaction is aborted and
+replaced by the new leader.
+
+### Process for removing a node from the cluster
+
+Removing a given node (let’s call it `doomed_node`) from the cluster
+follows a lot of the same rules as adding a node. The procedure is
+also run by a "driver" process. Here are the details:
+
+1. Driver invokes a RemoveServer() RPC on the quorum leader indicating
+which server to remove from the quorum.
+
+2. If `doomed_node` is not the quorum leader, the leader pushes the
+membership change through consensus using a ConfigChangeTransaction,
+with a quorum that no longer includes `doomed_node`.
+
+3. If `doomed_node` is the leader, the leader transfers quorum
+ownership to the most up-to-date follower in the quorum using the
+procedure outlined in [2] appendix section 3.10 and returns an RPC
+reply to the client `STEPPING_DOWN`, which means the driver should
+refresh its meta cache and try again later.
+
+### Preventing disruptive servers when removing a quorum member
+
+According to [2] section 4.2.3 we cannot use a "pre-vote check" that
+does log matching to prevent disruptive servers, however a pre-vote
+check that checks whether the recipient has heard from the leader in
+the past heartbeat period should work. An additional benefit to this
+is that the potential sender will not continuously increment their
+term number if the pre-vote check fails. So we will use such an
+approach instead of the suggested one.
+
+## Moving a tablet from one server to another
+
+Replacing a tablet server is always done as a series of steps:
+
+1. Add new server, wait for commit.
+
+2. Remove old server, wait for commit.
+
+This may require more design on the Master side. We’ll address that later.
+
+## Restoring availability after catastrophic data loss
+
+In the case of a permanent loss of a majority of a tablet quorum, all
+durability and consistency guarantees are lost. Assuming there is at
+least one remaining member of the quorum, we may be able to recover
+some data and regain quorum availability by replicating the remaining
+data. However this is highly dangerous and there is no way back once a
+manual process such as this is done.
+
+TODO: This somewhat orthogonal to online quorum changes, maybe move to
+another doc.
+
+### Steps:
+
+1. Run a tool to determine the most up-to-date remaining replica.
+
+2. Remote bootstrap additional nodes from the most up-to-date
+remaining node. Wait for remote bootstrap to complete on all the
+nodes.
+
+3. Bring all tablet servers hosting the affected tablet offline (TODO:
+This is possible to implement per-tablet but not currently supported)
+
+4. Run tool to rewrite the ConsensusMetadata file per-tablet server to
+forcefully update the quorum membership to add remotely bootstrapped
+nodes as followers. TODO: Violates Raft not to append to the log, do
+we also need to do that?
+
+5. Bring the affected tablets / tablet servers back online.
+
+6. Pray?
+
+
+## Appendix: idea to add a new quorum member before it has bootstrapped all data
+
+The idea here is to take advantage of the fact that nodes can
+participate in Raft consensus without actually applying operations to
+their "state machine" (database). In other words, a node doesn’t need
+to have any actual tablet data on it in order to add useful fault
+tolerance and latency-leveling properties. HydraBase calls this mode
+of follower a "WITNESS".
+
+For example, consider a three node quorum experiencing a failure:
+
+**key**: L = logging, V = voting, E = electable (has up-to-date tablet data), X = down
+
+**t=1**: [LVE] [LVE] [LVE]
+
+Initially, all replicas are logging, voting, and electable. At this
+point they can handle a fault of any node.
+
+**t=2**: **[LVE X]** [LVE] [LVE] (majority=2)
+
+If the first replica fails, now we have no further fault tolerance,
+since the majority is 2 and only 2 nodes are live. To solve this, we
+can add a new replica which is only logging and voting (but would
+never start an election). This proceeds in two steps:
+
+**t=3**: [LVE X] [LVE] [LVE] **[LV]** (majority=3)
+
+First, we add the new replica as voting. To add the node, we need a
+majority of 3/4, so fault tolerance is not improved.
+
+**t=4**:  **[L X]** [LVE] [LVE] [LV] (majority = 2, handle 1 fault)
+
+Next, we demote the dead replica from LVE to L, so it no longer
+participates in voting. For a server that has just failed, it’s
+preferable to demote to "L" and not completely remove from the quorum,
+because it’s possible (even likely!) it would actually restart before
+the new replica has finished bootstrapping. If it does, we have the
+option of adding it back to the quorum and cancelling the bootstrap.
+
+Because we now have three voting replicas, the majority is 2, so we
+can handle a fault of any of the remaining three nodes. After reaching
+this state, we can take our time to copy the tablet data to the new
+replica. At some point, the new replica has finished copying its data
+snapshot, and then replays its own log (as it would during bootstrap)
+until it is acting like a normal replica. Once it is a normal replica,
+it is now allowed to start elections.
+
+**t=5**:  [L X] [LVE] [LVE] **[LVE]** (majority = 2, handle 1 fault)
+
+At this point we are fully healed.
+
+### Advantages:
+
+The important advantage to this idea is that, when a node fails, we
+can very quickly regain our fault tolerance (on the order of two
+round-trips in order to perform two config changes). If we have to
+wait for the new tablet to bootstrap and replay all data, it may be
+tens of minutes or even hours before regaining fault tolerance.
+
+As an example, consider the case of a four-node cluster, each node
+having 1TB of replica data. If a node fails, then its 1TB worth of
+data must be transfered among the remaining nodes, so we need to wait
+for 300+GB of data to transfer, which could take up to an hour. During
+that hour, we would have no latency-leveling on writes unless we did
+something like the above.
+
+### Disadvantages:
+is this more complex?
+


[2/2] incubator-kudu git commit: KUDU-1338. Pending raft config should be cleared when CHANGE_CONFIG is aborted

Posted by to...@apache.org.
KUDU-1338. Pending raft config should be cleared when CHANGE_CONFIG is aborted

This fixes the issue described in KUDU-1338: when a config-change operation
is aborted on a replica, it's important to clear the pending configuration
state. Otherwise, we can hit either of two issues:

1) The replica will become 'stuck' in the case that a different leader
proposes a new change-config. The replica won't accept the new config
change since a change is already pending, and then never makes further
progress.

2) If the replica manages to elect itself leader, it could result
in Raft divergence when it continues to operate with a stale
configuration.

The patch contains a simple test modification which reliably reproduced the
first of the two issues.

Change-Id: Id2e99e4e67e2d6324c8123f79bea84523581b78b
Reviewed-on: http://gerrit.cloudera.org:8080/2483
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/c693d878
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/c693d878
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/c693d878

Branch: refs/heads/master
Commit: c693d878fdebae81c987466a1b38327cb015e8c3
Parents: 0f48f5e
Author: Todd Lipcon <to...@apache.org>
Authored: Tue Mar 8 09:40:11 2016 -0800
Committer: Todd Lipcon <to...@apache.org>
Committed: Mon Mar 28 22:20:31 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/raft-config-change.md             |  7 +++----
 src/kudu/consensus/raft_consensus.cc               |  3 ++-
 src/kudu/consensus/raft_consensus_state.cc         | 17 +++++++++++++++--
 src/kudu/integration-tests/raft_consensus-itest.cc |  9 +++++++++
 4 files changed, 29 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c693d878/docs/design-docs/raft-config-change.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/raft-config-change.md b/docs/design-docs/raft-config-change.md
index f6df406..7f02644 100644
--- a/docs/design-docs/raft-config-change.md
+++ b/docs/design-docs/raft-config-change.md
@@ -59,7 +59,6 @@ Raft author’s PhD thesis)](https://docs.google.com/a/cloudera.com/document/d/1
 
 We reference [2] a lot in this doc.
 
-
 ## Quorum membership change
 In Kudu, we change quorum membership following the one-by-one
 membership change design [2] from Diego Ongaro’s PhD thesis. We
@@ -81,13 +80,13 @@ the Master. We’ll say the node to be added to the cluster is named
 `new_node`.
 
 1. Driver initiates execution of remote bootstrap procedure of
-`new_node` from the current leader bootstrap_source using an RPC call to
+`new_node` from the current leader `bootstrap_source` using an RPC call to
 the `new_node`. Remote bootstrap runs to completion, which means all
 data and logs at the time remote bootstrap was initiated were
 replicated to `new_node`. Driver polls `new_node` for indication that the
 remote bootstrap process is complete.
 
-If the bootstrap_source node crashes before remote bootstrap is
+If the `bootstrap_source` node crashes before remote bootstrap is
 complete, the bootstrap fails and the driver must start the entire
 process over from the beginning. If the driver or `new_node` crashes and
 the tablet never joins the quorum, the Master should eventually delete
@@ -113,7 +112,7 @@ the quorum change in-memory. It does not wait for commitment to apply
 the change. See rationale in [2] section 4.1.
 
 4. The remote bootstrap session between `new_node` and
-bootstrap_source is closed once the config change to transition the
+`bootstrap_source` is closed once the config change to transition the
 node to `PRE_FOLLOWER` has been committed. This implies releasing an
 anchor on the log. Since `new_node` is already a member of the quorum
 receiving log updates, it should hold a log anchor on the leader

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c693d878/src/kudu/consensus/raft_consensus.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/raft_consensus.cc b/src/kudu/consensus/raft_consensus.cc
index 473d91e..d046a6d 100644
--- a/src/kudu/consensus/raft_consensus.cc
+++ b/src/kudu/consensus/raft_consensus.cc
@@ -1873,7 +1873,8 @@ void RaftConsensus::NonTxRoundReplicationFinished(ConsensusRound* round,
   string op_type_str = OperationType_Name(op_type);
   CHECK(IsConsensusOnlyOperation(op_type)) << "Unexpected op type: " << op_type_str;
   if (!status.ok()) {
-    // TODO: Do something with the status on failure?
+    // In the case that a change-config operation is aborted, RaftConsensusState
+    // already handled clearing the pending config state.
     LOG(INFO) << state_->LogPrefixThreadSafe() << op_type_str << " replication failed: "
               << status.ToString();
     client_cb.Run(status);

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c693d878/src/kudu/consensus/raft_consensus_state.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/raft_consensus_state.cc b/src/kudu/consensus/raft_consensus_state.cc
index e9f907e..240ae4f 100644
--- a/src/kudu/consensus/raft_consensus_state.cc
+++ b/src/kudu/consensus/raft_consensus_state.cc
@@ -420,8 +420,21 @@ Status ReplicaState::AbortOpsAfterUnlocked(int64_t new_preceding_idx) {
 
   for (; iter != pending_txns_.end();) {
     const scoped_refptr<ConsensusRound>& round = (*iter).second;
-    LOG_WITH_PREFIX_UNLOCKED(INFO) << "Aborting uncommitted operation due to leader change: "
-                                   << round->replicate_msg()->id();
+    auto op_type = round->replicate_msg()->op_type();
+    LOG_WITH_PREFIX_UNLOCKED(INFO)
+        << "Aborting uncommitted " << OperationType_Name(op_type)
+        << " operation due to leader change: " << round->replicate_msg()->id();
+
+    // When aborting a config-change operation, go back to using the committed
+    // configuration.
+    if (PREDICT_FALSE(op_type == CHANGE_CONFIG_OP)) {
+      CHECK(IsConfigChangePendingUnlocked())
+          << LogPrefixUnlocked() << "Aborting CHANGE_CONFIG_OP but "
+          << "there was no pending config set. Op: "
+          << round->replicate_msg()->ShortDebugString();
+      ClearPendingConfigUnlocked();
+    }
+
     round->NotifyReplicationFinished(Status::Aborted("Transaction aborted by new leader"));
     // Erase the entry from pendings.
     pending_txns_.erase(iter++);

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c693d878/src/kudu/integration-tests/raft_consensus-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/raft_consensus-itest.cc b/src/kudu/integration-tests/raft_consensus-itest.cc
index 6c25c8e..ab72b52 100644
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -1656,6 +1656,7 @@ TEST_F(RaftConsensusITest, TestReplaceChangeConfigOperation) {
 
   // Elect one of the other servers.
   ASSERT_OK(StartElection(tservers[1], tablet_id_, MonoDelta::FromSeconds(10)));
+  leader_tserver = tservers[1];
 
   // Resume the original leader. Its change-config operation will now be aborted
   // since it was never replicated to the majority, and the new leader will have
@@ -1665,6 +1666,14 @@ TEST_F(RaftConsensusITest, TestReplaceChangeConfigOperation) {
   // Insert some data and verify that it propagates to all servers.
   NO_FATALS(InsertTestRowsRemoteThread(0, 10, 1, vector<CountDownLatch*>()));
   ASSERT_ALL_REPLICAS_AGREE(10);
+
+  // Try another config change.
+  // This acts as a regression test for KUDU-1338, in which aborting the original
+  // config change didn't properly unset the 'pending' configuration.
+  ASSERT_OK(RemoveServer(leader_tserver, tablet_id_, tservers[2],
+                         -1, MonoDelta::FromSeconds(5),
+                         &error_code));
+  NO_FATALS(InsertTestRowsRemoteThread(10, 10, 1, vector<CountDownLatch*>()));
 }
 
 // Test the atomic CAS arguments to ChangeConfig() add server and remove server.