You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by ad...@apache.org on 2016/04/27 03:10:58 UTC

[3/3] incubator-kudu git commit: design-docs: multi-master for 1.0 release

design-docs: multi-master for 1.0 release

Here's the design doc I wrote for multi-master. It describes where we are
now and what we need to do to feel comfortable with the feature in Kudu 1.0.
Notably, client-related discussion is omitted. That's because I'm expecting
to fix it in the next release, so I haven't thought about it much. For now
I'm focusing on the master-side of the feature.

Change-Id: Iad76012977a45370b72a04d608371cecf90442ef
Reviewed-on: http://gerrit.cloudera.org:8080/2527
Tested-by: Adar Dembo <ad...@cloudera.com>
Reviewed-by: Todd Lipcon <to...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/73b8cc07
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/73b8cc07
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/73b8cc07

Branch: refs/heads/master
Commit: 73b8cc07ec0fda38a2627d5b7de70b07b7943c1e
Parents: 6e7a04a
Author: Adar Dembo <ad...@cloudera.com>
Authored: Thu Mar 10 20:17:17 2016 -0800
Committer: Adar Dembo <ad...@cloudera.com>
Committed: Wed Apr 27 01:09:52 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/README.md           |   1 +
 docs/design-docs/multi-master-1.0.md | 614 ++++++++++++++++++++++++++++++
 2 files changed, 615 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/73b8cc07/docs/design-docs/README.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/README.md b/docs/design-docs/README.md
index 24566e6..9475641 100644
--- a/docs/design-docs/README.md
+++ b/docs/design-docs/README.md
@@ -35,3 +35,4 @@ made.
 | [C++ client design and impl. details](cpp-client.md) | Client | N/A |
 | [(old) Heartbeating between tservers and multiple masters](old-multi-master-heartbeating.md) | Master | [gerrit](http://gerrit.cloudera.org:8080/2495) |
 | [Scan Token API](scan-tokens.md) | Client | [gerrit](http://gerrit.cloudera.org:8080/2443) |
+| [Full multi-master support for Kudu 1.0](multi-master-1.0.md) | Master, Client | [gerrit](http://gerrit.cloudera.org:8080/2527) |

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/73b8cc07/docs/design-docs/multi-master-1.0.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/multi-master-1.0.md b/docs/design-docs/multi-master-1.0.md
new file mode 100644
index 0000000..11101dc
--- /dev/null
+++ b/docs/design-docs/multi-master-1.0.md
@@ -0,0 +1,614 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Multi-master support for Kudu 1.0
+
+## Background
+
+Kudu's design avoids a single point of failure via multiple Kudu masters.
+Just as with tablets, master metadata is persisted to disk and replicated
+via Raft consensus, and so a deployment of **2N+1** masters can tolerate up
+to **N** failures.
+
+By the time Kudu's first beta launched, support for multiple masters had
+been implemented but was too fragile to be anything but experimental. The
+rest of this document describes the gaps that must be filled before
+multi-master support is ready for production, and lays out a plan for how
+to fill them.
+
+## Gaps in the master
+
+### Current design
+
+At startup, a master Raft configuration will elect a leader master. The
+leader master is responsible for servicing both tserver heartbeats as well
+as client requests. The follower masters participate in Raft consensus and
+replicate metadata, but are otherwise idle. Any heartbeats or client
+requests they receive are rejected.
+
+All persistent master metadata is stored in a single replicated tablet.
+Every row in this tablet represents either a table or a tablet. Table
+records include unique table identifiers, the table's schema, and other bits
+of information. Tablet records include a unique identifier, the tablet's
+Raft configuration, and other information.
+
+What master metadata is replicated?
+
+1. Table and tablet existence, via **CreateTable()** and **DeleteTable()**.
+   Every new tablet record also includes an initial Raft configuration.
+2. Schema changes, via **AlterTable()** and tserver heartbeats.
+3. Tablet server Raft configuration changes, via tserver heartbeats. These
+   include both the list of peers (may have changed due to
+   under-replication) as well as the current leader (may have changed due to
+   an election).
+
+Scanning the master tablet for every heartbeat or request would be slow,
+so the leader master caches all master metadata in memory. The caches are
+only updated after a metadata change is successfully replicated; in this
+way they are always consistent with the on-disk tablet. When a new leader
+master is elected, it scans the entire master tablet and uses the metadata
+to rebuild its in-memory caches.
+
+To understand how the in-memory caches work, let's start with the different
+kinds of information that are cached:
+
+1. Tablet server instance. Is uniquely identified by the tserver's UUID and
+   includes general metadata about the tserver, how recently this master
+   received a heartbeat from the tserver, and proxy objects that the master
+   uses to communicate with the tserver.
+2. Table instance. Is uniquely identified by the table's ID and includes
+   general metadata about the table.
+3. Tablet instance. Is uniquely identified by the tablet's ID and includes
+   general metadata about the tablet along with its current replicas.
+
+Now, let's describe the various data structures that store this information:
+
+1. Global: map of tserver UUID to tserver instance.
+2. Global: map of table ID to table instance. All tables are present, even
+   deleted tables.
+3. Global: map of table name to table instance. Deleted tables are omitted,
+   otherwise a new table could collide with a deleted one.
+4. Global: map of tablet ID to tablet instance. All tablets are present,
+   even deleted ones.
+5. Per-table instance: map of tablet ID to tablet instance. Deleted tablets
+   are retained here but are not inserted the next time master metadata is
+   reloaded from disk.
+6. Per-tablet instance: map of tserver UUID to tablet replica.
+
+With few exceptions (detailed below), these caches behave more or less as
+one would expect. For example, a successful **CreateTable()** yields a
+table, tablet, and replica instances, all of which are mapped accordingly. A
+heartbeat from a tserver may update a tablet's replica map if that tablet
+elected a new leader replica. And so on.
+
+All tservers start up with location information for the entire master Raft
+configuration, but are only responsible for heartbeating to the leader
+master. Prior to the first heartbeat, a tserver must determine which of the
+masters is the leader master. After that, the tserver will send heartbeats
+to that master until such a time that it fails or steps down, at which point
+the tserver must determine the new leader master and the cycle
+repeats. Follower masters ignore all heartbeats.
+
+The information communicated in a tserver's heartbeat varies. Normally the
+enclosed tablet report is "incremental" in that it only includes tablets
+that have undergone Raft configuration or role changes. In rarer
+circumstances the tablet report will be "full" and include every
+tablet. Importantly, incremental reports are "edge" triggered; that is,
+after a tablet is included in an incremental report **N**, it is omitted
+from incremental report **N+1**. Full tablet reports are sent when one of
+the following conditions is met:
+
+1. The master requests it in the previous heartbeat response, because the
+   master is seeing this tserver for the first time.
+2. The tserver has just started.
+
+Clients behave much like tservers: they are configured a priori with the
+entire master Raft configuration, must always communicate with the leader
+master, and will retry their requests until the new leader master is found.
+
+### Known issues
+
+#### KUDU-1358: **CreateTable()** may fail following a master election
+
+One of the aforementioned in-memory caches keeps track of all known tservers
+and their "liveness" (i.e. how likely they are to be alive). This cache is
+*NOT* rebuilt using persistent master metadata; instead, it is updated
+whenever an unknown tserver heartbeats to the leader master.
+**CreateTable()** requests use this cache to determine whether a new table
+can be satisfied using the current cluster size; if not, the request is
+rejected.
+
+Right after a master election, the new leader master may have a cold tserver
+cache if it's never seen any heartbeats before. Until an entire heartbeat
+interval has elapsed, this cold cache may harm **CreateTable()** requests:
+
+1. If insufficient tservers are known to the master, the request will fail.
+2. If not all tservers are known to the master, the cluster may become
+   unbalanced as tablets pile up on a particular subset of tservers.
+
+#### KUDU-1353: **AlterTable()** may get stuck
+
+Another in-memory cache that is *NOT* rebuilt using persistent master
+metadata is the per-tablet replica map. These maps are used to satisfy
+client **GetTableLocations()** and **GetTabletLocations()**
+requests. They're also used to determine the leader replica for various
+master to tserver RPC requests, such as in response to a client's
+**AlterTable()**. Each of these maps is updated during a tserver heartbeat
+using a tablet's latest Raft configuration.
+
+If a new leader master is elected and a tserver is already known to it
+(perhaps because the master had been leader before), every heartbeat it
+receives from that tserver will include an empty tablet report. This is
+problematic for the per-tablet replica maps, which will remain empty until
+tablets show up in a tablet report.
+
+Empty per-tablet replica maps in an otherwise healthy leader master can
+cause a variety of failures:
+
+1. **AlterTable()** requests may time out.
+2. **GetTableLocations()** and **GetTabletLocations()** requests may yield
+   no replica location information.
+3. **DeleteTable()** requests will succeed but won't delete any tablets
+   from tservers until those tablets find their way into tablet reports.
+
+#### KUDU-1374: Operations triggered by heartbeats may go unperformed
+
+As described earlier, the inclusion or exclusion of a tablet in an
+incremental tablet report is edge-triggered, and may result in a state
+changing operation on the tserver, communicated via out-of-band RPC. This
+RPC is retried until it is successful. However, if the leader master dies
+*after* it is able to respond to the tserver's heartbeat but *before* the
+out-of-band RPC is sent, the edge-triggered tablet report may be missed, and
+the state changing operation will not be performed until the next time the
+tablet is included in a tablet report. As tablet report inclusion criteria
+is narrow, operations may be "missed" for quite some time.
+
+These operations include:
+1. Some tablet deletions, such as tablets belonging to orphaned tables, or
+   tablets whose deletion RPCs were sent and failed during an earlier
+   **DeleteTable()** request.
+2. Some tablet alters, such as tablets whose alter RPCs were sent and failed
+   during an earlier **AlterTable()** request.
+3. Config changes sent due to under-replicated tablets.
+
+#### KUDU-495: Masters may abort if replication fails
+
+Some master operations will crash the master when replication fails. That's
+because they were implemented with local consensus in mind, wherein a
+replication failure is indicative of a disk failure and recovery is
+unlikely. With multiple masters and Raft consensus, replication may fail if
+the current leader master is no longer the leader master (e.g. it was
+partitioned from the rest of its Raft configuration, which promptly elected
+a new leader), raising the likelihood of a master crash.
+
+### Unimplemented features
+
+#### Master Raft configuration changes
+
+It's not currently possible to add or remove a master from an active Raft
+configuration.
+
+It would be nice to implement this for Kudu 1.0, but it's not a strict
+requirement.
+
+#### KUDU-500: allow followers to handle read-only client operations
+
+Currently followers reject any client operations as the expectation is that
+clients communicate with the leader master. As a performance optimization,
+followers could handle certain "safe" operations (i.e. read-only requests),
+but follower masters must do a better job of keeping their in-memory caches
+up-to-date before this change is made. Moreover, operations should include
+an indication of how stale the follower master's information is allowed to
+be for it to be considered acceptable by the client.
+
+It would be nice to implement this for Kudu 1.0, but it's not a strict
+requirement.
+
+## Gaps in the clients
+
+(TBD)
+
+TODO: JD says the code that detects the current master was partially removed
+from the Java client because it was buggy. Needs further investigation.
+
+## Goals
+
+1. The plan must address all of the aforementioned known issues.
+2. If possible, the plan should implement the missing features too.
+3. The plan's scope should be minimized, provided that doesn't complicate
+   future implementations. In other words, the plan should not force
+   backwards incompatibilities down the line.
+
+## Plan
+
+### Heartbeat to all masters
+
+This is probably the most effective way to address KUDU-1358, and, as a side
+benefit, helps implement the "verify cluster connectivity" feature. [This
+old design document](old-multi-master-heartbeating.md) describes KUDU-1358
+and its solutions in more detail.
+
+With this change, tservers no longer need to "follow the leader" as they will
+heartbeat to every master. However, a couple things need to change for this
+to work correctly:
+
+#### Follower masters must process heartbeats, at least in part
+
+Basically, they're only intended to refresh the master's notion of the
+tserver's liveness; table and tablet information is still replicated from
+the leader master and should be ignored if found in the heartbeat.
+
+#### All state changes taken by a tserver must be fenced
+
+That is, the master and/or tserver must enforce that all actions take effect
+iff they were sent by the master that is currently the leader.
+
+After an exhaustive audit of all master state changes (see appendix A), it
+was determined that the current protection mechanisms built into each RPC
+are sufficient to provide fencing. The one exception is orphaned replica
+deletion done in response to a heartbeat. To protect against that, true
+orphans (i.e. tablets for which no persistent record exists) will not be
+deleted at all. As the master retains deleted table/tablet metadata in
+perpetuity, this should ensure that true orphans appear only under drastic
+circumstances, such as a tserver that heartbeats to the wrong cluster.
+
+The following protection mechanisms are here for historical record; they
+will not be implemented.
+
+##### Alternative fencing mechanisms
+
+One way to do this is by including the current term in every requested state
+change and hearbeat response. Each tserver maintains the current term in
+memory, reset whenever a heartbeat response or RPC includes a later term
+(thus also serving as a "there is a new leader master" notification). If a
+state change request includes an older term, it is rejected. When a tserver
+first starts up, it initializes the current term with whatever term is
+indicated in the majority of the heartbeat responses. In this way it
+can protect itself from a "rogue master" at startup without having to
+persist the current term to disk.
+
+An alternative to the above fencing protocol is to ensure that the leader
+master replicates via Raft before triggering a state change. It doesn't
+matter what is replicated; a successful replication asserts that this master
+is still the leader. However, our Raft implementation doesn't currently
+allow for replicating no-ops (i.e. log entries that needn't be persisted).
+Moreover, this is effectively an implementation of "leader leases" (in that
+a successful replication grants the leader a "lease" to remain leader for at
+least one Raft replication interval), but one that the rest of Kudu must be
+made aware of in order to be fully robust.
+
+### Send full heartbeats to newly elected leader masters
+
+To address KUDU-1374, when a tserver passively detects that there's a new
+leader master (i.e. step #2 above), it should send it a full heartbeat. This
+will ensure that any heartbeat-triggered actions intended but not taken by
+the old leader master are reinitiated by the new one.
+
+### Ensure each logical operation is replicated as one batch
+
+Kudu doesn't yet support atomic multi-row transactions, but all row
+operations bound for one tablet and batched into one Write RPC are combined
+into one logical transaction. This property is useful for multi-master
+support as all master metadata is encapsulated into a single tablet. With
+some refactoring, it is possible to ensure that any logical operation
+(e.g. creating a table) is encapsulated into a single RPC. Doing so would
+obviate the need for "roll forward" repair of partially replicated
+operations during metadata load and is necessary to address KUDU-495.
+
+Some repair is still necessary for table-wide operations. These complete on
+a tablet by tablet basis and thus it is possible for partially created,
+altered, or deleted tables to exist at any point in time. However, the
+repair is a natural course of action taken by the leader master:
+
+1. During a tablet report: for example, if the master sees a tablet without
+   the latest schema, it'll send that tablet an idempotent alter RPC.
+   Further, thanks to the above full heartbeat change, the new leader master
+   will have an opportunity to roll forward such tables on master failover.
+2. In the background: the master will periodically scan its in-memory state
+   looking for tablets that have yet to be reported. If it finds one, it
+   will be given a "nudge"; the master will send a create tablet RPC to the
+   appropriate tserver. This scanning continues in a new leader master
+   following master failover.
+
+All batched logical operations include one table entry and *N* tablet
+entries, where *N* is the number of tablets in the table. These entries are
+encapsulated in a WriteRequestPB that is replicated by the leader master to
+follower masters. When *N* is very large, it is conceivable for the
+WriteRequestPB to exceed the maximum size of a Kudu RPC. To determine just
+how likely this is, replication RPC sizes were measured in the creation of a
+table with 1000 tablets and a simple three-column schema. The results: the
+replication RPC clocked in at ~117 KB, a far cry from the 8 MB maximum RPC
+size. Thus, a batch-based approach should not be unreasonable for today's
+scale targets.
+
+### Remove some unnecessary in-memory caches
+
+To fix KUDU-1353, the per-tablet replica locations could be removed entirely.
+The same information is already present in each tablet instance, just not in
+an easy to use map form. The only downside is that operations that previously
+used the cached locations would need to perform more lookups into the tserver
+map, to resolve tserver UUIDs into instances. We think this is a reasonable
+trade-off, however, as the tserver map should be hot.
+
+An alternative is to rebuild the per-tablet replica locations on metadata
+load, but the outright removal of that cached data is a simpler solution.
+
+The tserver cache could also be rebuilt, but:
+
+1. It will be incomplete, as only the last known RPC address (and none of
+   the HTTP addresses) is persisted. Additionally, addressing KUDU-418 may
+   require the removal of the last known RPC address from the persisted
+   state, at which point there's nothing worth rebuilding.
+2. The tserver cache is expected to be warm from the moment any master
+   (including followers) starts up due to "Heartbeat to all masters" above.
+
+Note: in-memory caches belonging to a former leader master will, by
+definition, contain stale information. These caches could be
+cleared following an election, but it shouldn't matter either way as this
+master is no longer servicing client requests or tserver heartbeats.
+
+### Ensure strict ordering for all state changes
+
+Master state change operations should adhere to the following contract:
+
+1. Acquire locks on the relevant table and/or tablets. Which locks and
+   whether the locks are held for reading or writing depends on the
+   operation. For example, **DeleteTable()** acquires locks for writing on
+   the table and all of its tablets. Table locks must be acquired before
+   tablet locks.
+2. Mutate in-memory state belonging to the table and/or tablets. These
+   mutations are made via COW so that concurrent readers continue to see
+   only "committed" data.
+3. Replicate all of the mutations via Raft consensus in a single batch. If
+   the replication fails, the overall operation fails, the mutations are
+   discarded, and the locks are released.
+4. Replication has succeeded; the operation may not fail beyond this point.
+5. Commit the mutations. If both table and tablet mutations are committed,
+   tablet mutations must come first. Any other in-memory changes must be
+   performed now.
+6. The success of the operation is now consistent on this master in both
+   on-disk state as well as in-memory state.
+7. Send RPCs to the tservers (e.g. **DeleteTable()** will now send delete
+   tablet RPCs to each tablet in the table). The work done by an RPC must
+   take place, either by retrying the RPC until it succeeds, or through some
+   other mechanism, such as sending a new RPC in response to a heartbeat.
+
+Generally speaking, this contract is upheld universally. However, a detailed
+audit (see appendix B) of the master has revealed a few exceptions:
+
+1. During **CreateTable()**, intermediate table and tablet state is made
+   visible prior to replication.
+2. During **DeleteTable()**, RPCs are sent prior to the committing of
+   mutations.
+3. When background scanning for newly created tablets, the logic that
+   "replaces" timed out tablets makes the replacements visible prior to
+   replication.
+
+To prevent clients from seeing intermediate state and other potential
+issues, these operations must be made to adhere to the above contract.
+
+## Appendix A: Fencing audit and discussion
+
+To understand which master operations need to be fenced, we ask the following
+key questions:
+
+1. What decisions can a master make unilaterally (i.e. without first
+   replicating so as to establish consensus)?
+2. When making such decisions, does the master at least consult past replicated
+   state first? If it does, would "stale" state yield incorrect decisions?
+3. If it doesn't (or can't) consult replicated state, are the external actions
+   performed as a result of the decision safe?
+
+We identified the set of potentially problematic external actions as those taken
+by the master during tablet reports.
+
+We ruled out ChangeConfig; it is safe due to the use of CAS on the last change
+config opid (protects against two leader masters both trying to add a server),
+and because if the master somehow added a redundant server, in the worst case
+the new replica will be deleted the next time it heartbeats.
+
+That left DeleteReplica, which is called under the following circumstances:
+
+1. When the master can't find any record of the replica's tablet or its table,
+   it is deleted.
+2. When the persistent state says that the replica (or its table) has been
+   deleted, it is deleted.
+3. When the persistent state says that the replica is no longer part of the Raft
+   config, it is deleted.
+4. When the persistent state includes replicas that aren't in the latest Raft
+   config, they are deleted.
+
+Like ChangeConfig, cases 3 and 4 are protected with a CAS. Cases 1 and 2 are
+not, but 2 falls into category #2 from earlier: if persistent state is consulted
+and the decision is made to delete a replica, that decision is correct and
+cannot become incorrect (i.e. under no circumstance would a tablet become
+"undeleted").
+
+That leaves case 1 as the only instance that needs additional fencing. We could
+implement leader leases as described earlier or "current term" checking to
+protect against it.
+
+Or, we could 1) continue our current policy of retaining persistent state of
+deleted tables/tablets forever, and 2) change the master not to delete
+tablets for which it has no records. If we always have the persistent state
+for deleted tables, all instances of case 1 become case 2 unless there's
+some drastic problem (e.g. tservers are heartbeating to the wrong master),
+in which case not deleting the tablets is probably the right thing to do.
+
+## Appendix B: Master operation audit and analysis
+
+The following are detailed audits of how each master operation works today.
+The description of each operation is followed by a list of potential
+improvements, all of which were already incorporated into the above
+plan. These audits may be useful to understanding the plan.
+
+### Creating a new table
+
+#### CreateTable() RPC
+
+1. create table in state UNKNOWN, begin mutation to state PREPARING
+2. create tablets in state UNKNOWN, begin mutation to state PREPARING
+3. update in-memory maps (table by id/name, tablet by id)
+   - new table and tablets are now visible with UNKNOWN state and no
+     useful metadata (table schema, name, tablet partitions, etc.)
+4. replicate new tablets
+5. change table from state PREPARING to RUNNING
+6. replicate new table
+7. commit mutation for table
+   - new table and tables are visible, table with full metadata in RUNNING
+     state but tablets still in UNKNOWN state without metadata
+8. commit mutations for tablets
+   - new table and tablets are visible with full metadata, table in
+     RUNNING state, tablets in PREPARING state (without consensus state)
+9. wake up bg_tasks thread
+
+Potential improvements:
+
+1. The two replications can be safely combined into one replication
+2. Consumers of table and tablet in-memory state must be prepared for
+   UNKNOWN intermediate state without metadata, as well as lack of consensus
+   state. This can be addressed by:
+   1. Adding a new global in-memory unordered_set to "lock" a table's name
+      while creating it. When the create is finished (regardless of
+      success), the name is unlocked. We could even reuse the tablet
+      LockManager for this, as it has no real tablet dependencies
+   2. Moving step #3 to after step #7
+
+#### Background task scanning
+
+1. for each tablet t in state PREPARING:
+   - change t from state PREPARING to CREATING
+2. for each tablet t_old in state CREATING:
+   - if t_old timed out:
+     1. create tablet t_new in state UNKNOWN, begin mutation to state PREPARING
+     2. add t_new to table's tablet_map
+        - t_new is now visible when operating on all of table's tablets, but
+          is in UNKNOWN state and has no useful metadata
+     3. update in-memory tablet_by_id map with t_new
+        - t_new is now visible to by tablet_id lookups, still UNKNOWN and
+          no useful metadata
+     4. change t_old from state CREATING to REPLACED
+     5. change t_new from state PREPARING to CREATING
+3. for each tablet t in state CREATING:
+   - reset t committed_consensus_state:
+     1. reset term to min term
+     2. reset consensus type to local or distributed
+     3. reset index to invalid op index
+     4. select peers
+4. replicate new and existing tablets
+5. if error in replica selection or replication:
+   1. update each table's in-memory tablet_map to remove t_new tablets (step 2)
+      - t_new tablets still visible to by tablet_id lookups
+   2. update tablet_by_id map to remove t_new tablets (step 2)
+      - t_new tablets no longer visible
+6. send DeleteTablet() RPCs for all t_old tablets (step 2)
+7. send CreateTablet() RPCs for all created tablets
+8. commit mutations for new and existing tablets
+   - replacement tablets from step #2 now have full metadata, all tablets now
+     have visible consensus state
+
+Potential improvements:
+
+1. All replication is already atomic; nothing to change here
+2. Steps 6 and 7 should probably be reordered after Step 8
+3. t_new can expose intermediate state to consumers, much like in
+   CreateTable(). Is this safe?
+   1. Remove step #5
+   2. After step #8 (but before RPCs), insert steps #2b and #2c
+
+### Deleting a table
+
+#### DeleteTable() RPC
+
+1. change table from state RUNNING (or ALTERING) to REMOVED
+2. replicate table
+3. commit mutation for table
+   - new state (REMOVED) is now visible
+4. for each tablet t:
+   1. Send DeleteTablet() RPC for t
+   2. change t from state RUNNING to DELETED
+   3. replicate t
+   4. commit mutation for t
+      - new state (DELETED) is now visible
+
+Potential improvements:
+
+1. Table and tablet replications can be safely combined, provided we invert
+   the commit order and make sure rest of master is OK with that
+2. DeleteTablet() RPC should come after tablet mutation is committed
+
+### Altering a table
+
+#### AlterTable() RPC
+
+1. if requested change to table name:
+   - change name
+2. if requested change to table schema:
+   1. reset fully_applied_schema with current schema
+   2. reset current schema
+3. increment schema version
+4. increment next column id
+5. change table from state RUNNING to ALTERING
+6. replicate table
+7. commit mutation to table
+   - new state (ALTERING), schema, etc. are now visible
+8. Send AlterTablet() RPCs for each tablet
+
+No potential improvements
+
+### Heartbeating
+
+#### Heartbeat for tablet t
+
+1. if t fails by_id lookup:
+   - send DeleteTablet() RPC for t
+2. if t's table does not exist:
+   - send DeleteTablet() RPC for t
+3. if t is REMOVED or REPLACED, or t's table is DELETED:
+   - send DeleteTablet() RPC for t
+4. if t is no longer in the list of peers:
+   - send DeleteTablet() RPC for t
+5. if t CREATING and has leader:
+   - change t from CREATING to RUNNING
+6. change consensus state
+   1. reset committed_consensus_state (if it exists in report)
+   2. update tablet replica locations
+      - new replica locations now immediately visible
+   3. if t is no longer in the list of peers:
+      - send DeleteTablet() RPC for t
+   4. if tablet is under-replicated
+      - send ConfigChange() RPC for t
+8. update t
+9. commit mutation for t
+   - new state and committed consensus state are now visible
+10. if t reported schema version isn't latest:
+   - send AlterTablet() RPC for t
+11. else:
+   1. update tablet schema version
+      - new schema version is immediately visible
+   2. if table is in state ALTERING and all tablets have latest schema version:
+      1. clear fully_applied_schema
+      2. change table from state ALTERING to RUNNING
+      3. replicate table
+      4. commit mutation for table
+         - new state (RUNNING) and empty fully_applied_schema now visible
+
+No potential improvements. One replication per tablet (as written) is OK
+because:
+
+1. If replication fails but master still alive: tserver will retry same
+   kind of heartbeat, giving master a chance to replicate again
+2. If replication fails and master dies: with proposed "send full heartbeat
+   on new leader master" change, failed replications will be retried by new
+   leader master