You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2016/06/17 19:39:42 UTC
incubator-kudu git commit: Reformat raft-config-change.md to clean it
up
Repository: incubator-kudu
Updated Branches:
refs/heads/master 1fe49562d -> e18f6ab21
Reformat raft-config-change.md to clean it up
It needed a bit of markdown love.
Also, s/quorum/configuration/ when it makes sense. Sometimes we use the
word "membership" when it seems clearer to do so.
Change-Id: Iacbd37b8b22ae02dba6d3fd2c389dd97eafe61ed
Reviewed-on: http://gerrit.cloudera.org:8080/3399
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/e18f6ab2
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/e18f6ab2
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/e18f6ab2
Branch: refs/heads/master
Commit: e18f6ab217fcc9c910284021c6ed998e5ce70ee4
Parents: 1fe4956
Author: Mike Percy <mp...@apache.org>
Authored: Thu Jun 16 19:58:36 2016 -0700
Committer: Mike Percy <mp...@apache.org>
Committed: Fri Jun 17 19:39:25 2016 +0000
----------------------------------------------------------------------
docs/design-docs/raft-config-change.md | 327 +++++++++++++---------------
1 file changed, 148 insertions(+), 179 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/e18f6ab2/docs/design-docs/raft-config-change.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/raft-config-change.md b/docs/design-docs/raft-config-change.md
index 7f02644..d67c6cc 100644
--- a/docs/design-docs/raft-config-change.md
+++ b/docs/design-docs/raft-config-change.md
@@ -12,7 +12,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
-# Kudu quorum membership change design
+# Kudu Raft configuration change design
For operability and availability reasons, we want to be able to
dynamically change the set of tablet servers that host a given Kudu
@@ -20,258 +20,227 @@ tablet. The use cases for this functionality include:
* Replacing a failed tablet server to maintain the desired replication
factor of tablet data.
-
-* Growing the Kudu cluster over time.
-
-* "Rebalancing" tablet locations to even out the load across tablet
+* Growing the Kudu cluster over time. "Rebalancing" tablet locations to even
+* out the load across tablet
servers.
-
* Increasing the replication of one or more tablets of a table if they
- become hot (eg in a time series workload, making today\u2019s partitions
- have a higher replication)
-
+ become hot (eg in a time series workload, making today\u2019s partitions have a
+ higher replication)
## Scope
This document covers the following topics:
-* Design and implementation of quorum membership change at the
+* Design and implementation of membership change at the
consensus level.
-
* Process for adding / removing a tablet server to / from a running
- tablet quorum.
-
+ tablet configuration.
* Process for moving a tablet replica from one tablet server to
another.
-
* Process for restoring availability while attempting to minimize data
- loss after catastrophic failure (permanent failure of a majority of
- the nodes). Since there can be no guarantee or bound on the amount
- of data that may be lost in such a scenario, we only provide a high
- level approach to allow for attempting a manual repair.
+ loss after catastrophic failure (permanent failure of a majority of the
+ nodes). Since there can be no guarantee or bound on the amount of data that
+ may be lost in such a scenario, we only provide a high level approach to
+ allow for attempting a manual repair.
## References
-[1] [Raft paper](https://ramcloud.stanford.edu/raft.pdf)
-
-[2] [Raft cluster membership changes (summarizes extensions from
-Raft author\u2019s PhD thesis)](https://docs.google.com/a/cloudera.com/document/d/14LLUHWmr17_7iSrMRmpzgCx9OLQKFxEsPABs0-JBGRU)
-[3] [Design review notes](https://docs.google.com/a/cloudera.com/document/d/1q6S7Z3PzZUq8jCgCOJzaPohFzvZyfHhWHLvoQUOxj_U)
+[1] [Raft paper](https://raft.github.io/raft.pdf)<br>
+[2] [Diego Ongaro's Ph.D. dissertation](https://github.com/ongardie/dissertation#readme)
We reference [2] a lot in this doc.
-## Quorum membership change
-In Kudu, we change quorum membership following the one-by-one
-membership change design [2] from Diego Ongaro\u2019s PhD thesis. We
-provide a rough outline of the one-by-one design as outlined in the
-thesis, however this doc is mostly concerned with the Kudu-specific
-details and deviations from Raft.
+## Membership change
+
+In Kudu, we change membership following the one-by-one membership change design
+[2] from Diego Ongaro\u2019s PhD dissertation. We provide a rough outline of the
+one-by-one design as outlined in the dissertation, however this doc is mostly
+concerned with the Kudu-specific details and deviations from Raft.
### One-by-one membership change
-We can only make one addition or subtraction to the quorum
-atomically. Until one such change (i.e. change config transaction)
-commits or aborts, no others may be started. This gives us safety
-guarantees. The proof is outlined in [2].
+We can only make one addition or subtraction to the configuration atomically.
+Until one such change (i.e. change config transaction) commits or aborts, no
+others may be started. This gives us safety guarantees. The proof is outlined
+in [2].
### Process for adding a new node to the cluster
-This process is executed by a driver, which may be a client program or
-the Master. We\u2019ll say the node to be added to the cluster is named
-`new_node`.
-
-1. Driver initiates execution of remote bootstrap procedure of
-`new_node` from the current leader `bootstrap_source` using an RPC call to
-the `new_node`. Remote bootstrap runs to completion, which means all
-data and logs at the time remote bootstrap was initiated were
-replicated to `new_node`. Driver polls `new_node` for indication that the
-remote bootstrap process is complete.
-
-If the `bootstrap_source` node crashes before remote bootstrap is
-complete, the bootstrap fails and the driver must start the entire
-process over from the beginning. If the driver or `new_node` crashes and
-the tablet never joins the quorum, the Master should eventually delete
-the abandoned tablet replica from `new_node`.
-
-2. Driver invokes the AddServer() RPC call on the leader to add
-`new_node` as a `PRE_FOLLOWER` to the quorum. This is a new role type,
-which does not have voting rights. Replicate this config change
-through the cluster (does not change voting majority). The leader will
-automatically transition a `PRE_FOLLOWER` to a `FOLLOWER` (with voting
-rights, implying a potential majority change) when it detects
-`new_node` has caught up sufficiently to replicate the remaining log
-entries within an election timeout (see [2] section 4.2.1). Several
-nodes may be in `PRE_FOLLOWER` mode at a given time, but when
-transitioning to `FOLLOWER` the one-by-one rules still apply.
-
-Failure to add the node as a `PRE_FOLLOWER` (usually due to a leader
-change or weakness in the quorum) will require a retry later by the
-driver.
-
-3. As soon as a replica receives the ConfigChangeRequest it applies
-the quorum change in-memory. It does not wait for commitment to apply
-the change. See rationale in [2] section 4.1.
-
-4. The remote bootstrap session between `new_node` and
-`bootstrap_source` is closed once the config change to transition the
-node to `PRE_FOLLOWER` has been committed. This implies releasing an
-anchor on the log. Since `new_node` is already a member of the quorum
-receiving log updates, it should hold a log anchor on the leader
-starting at the as-yet unreplicated data, so this overlap is safe
-[TODO: this may not yet be implemented, need to check].
-
-5. Eventually the ConfigChangeTransaction is committed and the
-membership change is made durable.
+This process is executed by a driver, which may be a client program or the
+Master. We\u2019ll say the node to be added to the cluster is named `new_node`.
+
+1. Driver initiates execution of remote bootstrap procedure of `new_node` from
+ the current leader `bootstrap_source` using an RPC call to the `new_node`.
+ Remote bootstrap runs to completion, which means all data and logs at the
+ time remote bootstrap was initiated were replicated to `new_node`. Driver
+ polls `new_node` for indication that the remote bootstrap process is
+ complete.
+ <br>
+ If the `bootstrap_source` node crashes before remote bootstrap is complete,
+ the bootstrap fails and the driver must start the entire process over from
+ the beginning. If the driver or `new_node` crashes and the tablet never
+ joins the configuration, the Master should eventually delete the abandoned
+ tablet replica from `new_node`.
+2. Driver invokes the AddServer() RPC call on the leader to add `new_node` as a
+ `PRE_FOLLOWER` to the configuration. This is a new role type, which does not
+ have voting rights. Replicate this config change through the cluster (does
+ not change voting majority). The leader will automatically transition a
+ `PRE_FOLLOWER` to a `FOLLOWER` (with voting rights, implying a potential
+ majority change) when it detects `new_node` has caught up sufficiently to
+ replicate the remaining log entries within an election timeout (see [2]
+ section 4.2.1). Several nodes may be in `PRE_FOLLOWER` mode at a given time,
+ but when transitioning to `FOLLOWER` the one-by-one rules still apply.
+ <br>
+ Failure to add the node as a `PRE_FOLLOWER` (usually due to a leader change
+ or weakness in the configuration) will require a retry later by the driver.
+3. As soon as a replica receives the ConfigChangeRequest it applies the
+ configuration change in-memory. It does not wait for commitment to apply the
+ change. See rationale in [2] section 4.1.
+4. The remote bootstrap session between `new_node` and `bootstrap_source` is
+ closed once the config change to transition the node to `PRE_FOLLOWER` has
+ been committed. This implies releasing an anchor on the log. Since
+ `new_node` is already a member of the configuration receiving log updates,
+ it should hold a log anchor on the leader starting at the as-yet
+ unreplicated data, so this overlap is safe [TODO: this may not yet be
+ implemented, need to check].
+5. Eventually the ConfigChangeTransaction is committed and the membership
+ change is made durable.
### Config change transaction implementation details
-When a config change transaction is received indicating a membership
-change, we apply the change as WIP config change without committing it
-to disk. Consensus commit of the ChangeConfigTransaction causes us to
-sync ConsensusMeta to disk (Raft relies on the log durability but we
-don\u2019t want to prevent log GC due to config change entries).
+When a config change transaction is received indicating a membership change, we
+apply the change as WIP config change without committing it to disk. Consensus
+commit of the ChangeConfigTransaction causes us to sync ConsensusMeta to disk
+(Raft relies on the log durability but we don\u2019t want to prevent log GC due to
+config change entries).
-This approach allows us to "roll back" to the last-committed quorum
-membership in the case that a change config transaction is aborted and
-replaced by the new leader.
+This approach allows us to "roll back" to the last-committed configuration
+membership in the case that a change config transaction is aborted and replaced
+by the new leader.
### Process for removing a node from the cluster
-Removing a given node (let\u2019s call it `doomed_node`) from the cluster
-follows a lot of the same rules as adding a node. The procedure is
-also run by a "driver" process. Here are the details:
-
-1. Driver invokes a RemoveServer() RPC on the quorum leader indicating
-which server to remove from the quorum.
-
-2. If `doomed_node` is not the quorum leader, the leader pushes the
-membership change through consensus using a ConfigChangeTransaction,
-with a quorum that no longer includes `doomed_node`.
-
-3. If `doomed_node` is the leader, the leader transfers quorum
-ownership to the most up-to-date follower in the quorum using the
-procedure outlined in [2] appendix section 3.10 and returns an RPC
-reply to the client `STEPPING_DOWN`, which means the driver should
-refresh its meta cache and try again later.
-
-### Preventing disruptive servers when removing a quorum member
-
-According to [2] section 4.2.3 we cannot use a "pre-vote check" that
-does log matching to prevent disruptive servers, however a pre-vote
-check that checks whether the recipient has heard from the leader in
-the past heartbeat period should work. An additional benefit to this
-is that the potential sender will not continuously increment their
-term number if the pre-vote check fails. So we will use such an
-approach instead of the suggested one.
+Removing a given node (let\u2019s call it `doomed_node`) from the cluster follows a
+lot of the same rules as adding a node. The procedure is also run by a "driver"
+process. Here are the details:
+
+1. Driver invokes a RemoveServer() RPC on the configuration leader indicating
+ which server to remove from the configuration.
+2. If `doomed_node` is not the configuration leader, the leader pushes the
+ membership change through consensus using a ConfigChangeTransaction, with a
+ configuration that no longer includes `doomed_node`.
+3. If `doomed_node` is the leader, the leader transfers configuration ownership
+ to the most up-to-date follower in the configuration using the procedure
+ outlined in [2] appendix section 3.10 and returns an RPC reply to the client
+ `STEPPING_DOWN`, which means the driver should refresh its meta cache and
+ try again later.
+
+### Preventing disruptive servers when removing a member
+
+According to [2] section 4.2.3 we cannot use a "pre-vote check" that does log
+matching to prevent disruptive servers, however a pre-vote check that checks
+whether the recipient has heard from the leader in the past heartbeat period
+should work. An additional benefit to this is that the potential sender will
+not continuously increment their term number if the pre-vote check fails. So we
+will use such an approach instead of the suggested one.
## Moving a tablet from one server to another
Replacing a tablet server is always done as a series of steps:
1. Add new server, wait for commit.
-
2. Remove old server, wait for commit.
This may require more design on the Master side. We\u2019ll address that later.
## Restoring availability after catastrophic data loss
-In the case of a permanent loss of a majority of a tablet quorum, all
-durability and consistency guarantees are lost. Assuming there is at
-least one remaining member of the quorum, we may be able to recover
-some data and regain quorum availability by replicating the remaining
-data. However this is highly dangerous and there is no way back once a
-manual process such as this is done.
+In the case of a permanent loss of a majority of a tablet configuration, all
+durability and consistency guarantees are lost. Assuming there is at least one
+remaining member of the configuration, we may be able to recover some data and
+regain configuration availability by replicating the remaining data. However
+this is highly dangerous and there is no way back once a manual process such as
+this is done.
-TODO: This somewhat orthogonal to online quorum changes, maybe move to
+TODO: This somewhat orthogonal to online configuration changes, maybe move to
another doc.
### Steps:
1. Run a tool to determine the most up-to-date remaining replica.
-
-2. Remote bootstrap additional nodes from the most up-to-date
-remaining node. Wait for remote bootstrap to complete on all the
-nodes.
-
-3. Bring all tablet servers hosting the affected tablet offline (TODO:
-This is possible to implement per-tablet but not currently supported)
-
+2. Remote bootstrap additional nodes from the most up-to-date remaining node.
+ Wait for remote bootstrap to complete on all the nodes.
+3. Bring all tablet servers hosting the affected tablet offline (TODO: This is
+ possible to implement per-tablet but not currently supported)
4. Run tool to rewrite the ConsensusMetadata file per-tablet server to
-forcefully update the quorum membership to add remotely bootstrapped
-nodes as followers. TODO: Violates Raft not to append to the log, do
-we also need to do that?
-
+ forcefully update the configuration membership to add remotely bootstrapped
+ nodes as followers. TODO: Violates Raft not to append to the log, do we also
+ need to do that?
5. Bring the affected tablets / tablet servers back online.
-
6. Pray?
+## Appendix: Idea to add a new member before it has bootstrapped all data
-## Appendix: idea to add a new quorum member before it has bootstrapped all data
-
-The idea here is to take advantage of the fact that nodes can
-participate in Raft consensus without actually applying operations to
-their "state machine" (database). In other words, a node doesn\u2019t need
-to have any actual tablet data on it in order to add useful fault
-tolerance and latency-leveling properties. HydraBase calls this mode
-of follower a "WITNESS".
+The idea here is to take advantage of the fact that nodes can participate in
+Raft consensus without actually applying operations to their "state machine"
+(database). In other words, a node doesn\u2019t need to have any actual tablet data
+on it in order to add useful fault tolerance and latency-leveling properties.
+HydraBase calls this mode of follower a "WITNESS".
-For example, consider a three node quorum experiencing a failure:
+For example, consider a three node configuration experiencing a failure:
**key**: L = logging, V = voting, E = electable (has up-to-date tablet data), X = down
-**t=1**: [LVE] [LVE] [LVE]
+**t=1**: {LVE} {LVE} {LVE}
Initially, all replicas are logging, voting, and electable. At this
point they can handle a fault of any node.
-**t=2**: **[LVE X]** [LVE] [LVE] (majority=2)
+**t=2**: **{LVE X}** {LVE} {LVE} (majority=2)
-If the first replica fails, now we have no further fault tolerance,
-since the majority is 2 and only 2 nodes are live. To solve this, we
-can add a new replica which is only logging and voting (but would
-never start an election). This proceeds in two steps:
+If the first replica fails, now we have no further fault tolerance, since the
+majority is 2 and only 2 nodes are live. To solve this, we can add a new
+replica which is only logging and voting (but would never start an election).
+This proceeds in two steps:
-**t=3**: [LVE X] [LVE] [LVE] **[LV]** (majority=3)
+**t=3**: {LVE X} {LVE} {LVE} **{LV}** (majority=3)
-First, we add the new replica as voting. To add the node, we need a
-majority of 3/4, so fault tolerance is not improved.
+First, we add the new replica as voting. To add the node, we need a majority of
+3/4, so fault tolerance is not improved.
-**t=4**: **[L X]** [LVE] [LVE] [LV] (majority = 2, handle 1 fault)
+**t=4**: **{L X}** {LVE} {LVE} {LV} (majority = 2, handle 1 fault)
-Next, we demote the dead replica from LVE to L, so it no longer
-participates in voting. For a server that has just failed, it\u2019s
-preferable to demote to "L" and not completely remove from the quorum,
-because it\u2019s possible (even likely!) it would actually restart before
-the new replica has finished bootstrapping. If it does, we have the
-option of adding it back to the quorum and cancelling the bootstrap.
+Next, we demote the dead replica from LVE to L, so it no longer participates in
+voting. For a server that has just failed, it\u2019s preferable to demote to "L" and
+not completely remove from the configuration, because it\u2019s possible (even
+likely!) it would actually restart before the new replica has finished
+bootstrapping. If it does, we have the option of adding it back to the
+configuration and cancelling the bootstrap.
-Because we now have three voting replicas, the majority is 2, so we
-can handle a fault of any of the remaining three nodes. After reaching
-this state, we can take our time to copy the tablet data to the new
-replica. At some point, the new replica has finished copying its data
-snapshot, and then replays its own log (as it would during bootstrap)
-until it is acting like a normal replica. Once it is a normal replica,
-it is now allowed to start elections.
+Because we now have three voting replicas, the majority is 2, so we can handle
+a fault of any of the remaining three nodes. After reaching this state, we can
+take our time to copy the tablet data to the new replica. At some point, the
+new replica has finished copying its data snapshot, and then replays its own
+log (as it would during bootstrap) until it is acting like a normal replica.
+Once it is a normal replica, it is now allowed to start elections.
-**t=5**: [L X] [LVE] [LVE] **[LVE]** (majority = 2, handle 1 fault)
+**t=5**: {L X} {LVE} {LVE} **{LVE}** (majority = 2, handle 1 fault)
At this point we are fully healed.
### Advantages:
-The important advantage to this idea is that, when a node fails, we
-can very quickly regain our fault tolerance (on the order of two
-round-trips in order to perform two config changes). If we have to
-wait for the new tablet to bootstrap and replay all data, it may be
-tens of minutes or even hours before regaining fault tolerance.
+The important advantage to this idea is that, when a node fails, we can very
+quickly regain our fault tolerance (on the order of two round-trips in order to
+perform two config changes). If we have to wait for the new tablet to bootstrap
+and replay all data, it may be tens of minutes or even hours before regaining
+fault tolerance.
-As an example, consider the case of a four-node cluster, each node
-having 1TB of replica data. If a node fails, then its 1TB worth of
-data must be transfered among the remaining nodes, so we need to wait
-for 300+GB of data to transfer, which could take up to an hour. During
-that hour, we would have no latency-leveling on writes unless we did
-something like the above.
+As an example, consider the case of a four-node cluster, each node having 1TB
+of replica data. If a node fails, then its 1TB worth of data must be transfered
+among the remaining nodes, so we need to wait for 300+GB of data to transfer,
+which could take up to an hour. During that hour, we would have no
+latency-leveling on writes unless we did something like the above.
### Disadvantages:
-is this more complex?
+Is this more complex?