You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by ji...@apache.org on 2016/02/18 20:15:09 UTC

mesos git commit: Documented how the replicated log works.

Repository: mesos
Updated Branches:
  refs/heads/master 9a2c10d7d -> 454cdf42d


Documented how the replicated log works.

This is closely based on an (unpublished) blog post by Jie Yu.

Review: https://reviews.apache.org/r/43712


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/454cdf42
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/454cdf42
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/454cdf42

Branch: refs/heads/master
Commit: 454cdf42d6c9d3387391665e1f72594c48838911
Parents: 9a2c10d
Author: Neil Conway <ne...@gmail.com>
Authored: Fri Feb 5 14:14:37 2016 -0800
Committer: Jie Yu <yu...@gmail.com>
Committed: Thu Feb 18 11:14:37 2016 -0800

----------------------------------------------------------------------
 docs/configuration.md                     |   4 +-
 docs/high-availability-framework-guide.md |  10 +--
 docs/home.md                              |   1 +
 docs/images/log-architecture.png          | Bin 0 -> 21105 bytes
 docs/images/log-cluster.png               | Bin 0 -> 14415 bytes
 docs/maintenance.md                       |   4 +-
 docs/operational-guide.md                 |   2 +-
 docs/replicated-log-internals.md          | 117 +++++++++++++++++++++++++
 8 files changed, 128 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index 801472c..b04e873 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -547,8 +547,8 @@ Currently there is no support for multiple HTTP authenticators. (default: basic)
     --[no-]log_auto_initialize
   </td>
   <td>
-Whether to automatically initialize the replicated log used for the
-registry. If this is set to false, the log has to be manually
+Whether to automatically initialize the [replicated log](replicated-log-internals.md)
+used for the registry. If this is set to false, the log has to be manually
 initialized when used for the very first time. (default: true)
   </td>
 </tr>

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/high-availability-framework-guide.md
----------------------------------------------------------------------
diff --git a/docs/high-availability-framework-guide.md b/docs/high-availability-framework-guide.md
index f21f95f..0d9c483 100644
--- a/docs/high-availability-framework-guide.md
+++ b/docs/high-availability-framework-guide.md
@@ -57,9 +57,9 @@ availability:
     accordingly.
 
   * Mesos actually provides ordered (but unreliable) message delivery between
-    any two pair of processes: for example, if a framework sends messages M1 and
-    M2 to the master, the master might receive no messages, just M1, just M2, or
-    M1 followed by M2 -- it will _not_ receive M2 followed by M1.
+    any pair of processes: for example, if a framework sends messages M1 and M2
+    to the master, the master might receive no messages, just M1, just M2, or M1
+    followed by M2 -- it will _not_ receive M2 followed by M1.
 
   * As a convenience for framework authors, Mesos provides reliable delivery of
     task status updates. The agent persists task status updates to disk and then
@@ -136,7 +136,7 @@ Highly available framework designs typically follow a few common patterns:
    and pending tasks. In fact, the same coordination service that is used for
    leader election (such as ZooKeeper or etcd) can often be used for this
    purpose. Some Mesos frameworks (such as Apache Aurora) use the Mesos
-   replicated log for this purpose.
+   [replicated log](replicated-log-internals.md) for this purpose.
 
    * The data store should be used to record the actions that the scheduler
      _intends_ to take, before it takes them. For example, if a scheduler
@@ -262,7 +262,7 @@ it from the cluster. Specifically:
   task that was running on a removed agent.
 
     >NOTE: Neither the callback nor the updates are reliably delivered by the
-    master. For example if the master or scheduler fails over or there is a
+    master. For example, if the master or scheduler fails over or there is a
     network connectivity issue during the delivery of these messages, they will
     not be resent.
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/home.md
----------------------------------------------------------------------
diff --git a/docs/home.md b/docs/home.md
index 982ad28..07214b9 100644
--- a/docs/home.md
+++ b/docs/home.md
@@ -45,6 +45,7 @@ layout: documentation
 * [Multiple Disks](multiple-disk.md) for how to to allow tasks to use multiple isolated disk resources.
 * [Quota](quota.md) for how to configure Mesos to provide guaranteed resource allocations for use by a role.
 * [Reservation](reservation.md) for how operators and frameworks can reserve resources on individual agents for use by a role.
+* [Replicated Log](replicated-log-internals.md) for information on the Mesos replicated log.
 
 ## Running Mesos Frameworks
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/images/log-architecture.png
----------------------------------------------------------------------
diff --git a/docs/images/log-architecture.png b/docs/images/log-architecture.png
new file mode 100644
index 0000000..34c57f1
Binary files /dev/null and b/docs/images/log-architecture.png differ

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/images/log-cluster.png
----------------------------------------------------------------------
diff --git a/docs/images/log-cluster.png b/docs/images/log-cluster.png
new file mode 100644
index 0000000..62042d2
Binary files /dev/null and b/docs/images/log-cluster.png differ

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/maintenance.md
----------------------------------------------------------------------
diff --git a/docs/maintenance.md b/docs/maintenance.md
index e6bfe0f..4d24ec6 100644
--- a/docs/maintenance.md
+++ b/docs/maintenance.md
@@ -155,8 +155,8 @@ To cancel a maintenance schedule, the operator should post an empty schedule.
 
 As soon as a schedule is posted to the Mesos master, the following things occur:
 
-* The schedule is stored in the replicated log.  This means
-  the schedule is persisted in case of master failover.
+* The schedule is stored in the [replicated log](replicated-log-internals.md).
+  This means the schedule is persisted in case of master failover.
 * All machines in the schedule are immediately transitioned into Draining
   mode.  The mode of each machine is also persisted in the replicated log.
 * All frameworks using resources on affected agents are immediately

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/operational-guide.md
----------------------------------------------------------------------
diff --git a/docs/operational-guide.md b/docs/operational-guide.md
index 4680ee3..a4d6710 100644
--- a/docs/operational-guide.md
+++ b/docs/operational-guide.md
@@ -6,7 +6,7 @@ Mesos uses a "[fail-fast](https://en.wikipedia.org/wiki/Fail-fast)" approach to
 To ensure that such failures are handled appropriately, production deployments of Mesos typically use a _process supervisor_ (such as systemd or supervisord) to detect when Mesos processes exit. The supervisor can be configured to restart the failed process automatically and/or to notify the cluster operator to investigate the situation.
 
 ## Changing the master quorum
-The master leverages a Paxos-based replicated log as its storage backend (`--registry=replicated_log` is the only storage backend currently supported). Each master participates in the ensemble as a log replica. The `--quorum` flag determines a majority of the masters.
+The master leverages a [Paxos-based replicated log](replicated-log-internals.md) as its storage backend (`--registry=replicated_log` is the only storage backend currently supported). Each master participates in the ensemble as a log replica. The `--quorum` flag determines a majority of the masters.
 
 The following table shows the tolerance to master failures for each quorum size:
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/replicated-log-internals.md
----------------------------------------------------------------------
diff --git a/docs/replicated-log-internals.md b/docs/replicated-log-internals.md
new file mode 100644
index 0000000..4f379a3
--- /dev/null
+++ b/docs/replicated-log-internals.md
@@ -0,0 +1,117 @@
+---
+layout: documentation
+---
+
+# The Mesos Replicated Log
+
+Mesos provides a library that lets you create replicated fault-tolerant append-only logs; this library is known as the _replicated log_. The Mesos master uses this library to store cluster state in a replicated, durable way; the library is also available for use by frameworks to store replicated framework state or to implement the common "[replicated state machine](https://en.wikipedia.org/wiki/State_machine_replication)" pattern.
+
+## What is the replicated log?
+
+![Aurora and the Replicated Log](images/log-cluster.png)
+
+The replicated log provides _append-only_ storage of _log entries_; each log entry can contain arbitrary data. The log is _replicated_, which means that each log entry has multiple copies in the system. Replication provides both fault tolerance and high availability. In the following example, we use [Apache Aurora](https://aurora.apache.org/), a fault tolerant scheduler (i.e., framework) running on top of Mesos, to show a typical replicated log setup.
+
+As shown above, there are multiple Aurora instances running simultaneously (for high availability), with one elected as the leader. There is a log replica on each host running Aurora. Aurora can access the replicated log through a thin library containing the log API.
+
+Typically, the leader is the only one that appends data to the log. Each log entry is replicated and sent to all replicas in the system. Replicas are strongly consistent. In other words, all replicas agree on the value of each log entry. Because the log is replicated, when Aurora decides to failover, it does not need to copy the log from a remote host.
+
+
+## Use Cases
+
+The replicated log can be used to build a wide variety of distributed applications. For example, Aurora uses the replicated log to store all task states and job configurations. The Mesos master's _registry_ also leverages the replicated log to store information about all slaves in the cluster.
+
+The replicated log is often used to allow applications to manage replicated state in a strongly consistent way. One way to do this is to store a state-mutating operation in each log entry and have all instances of the distributed application agree on the same initial state (e.g., empty state). The replicated log ensures that each application instance will observe the same sequence of log entries in the same order; as long as applying a state-mutating operation is deterministic, this ensures that all application instances will remain consistent with one another. If any instance of the application crashes, it can reconstruct the current version of the replicated state by starting at the initial state and re-applying all the logged mutations in order.
+
+If the log grows too large, an application can write out a snapshot and then delete all the log entries that occurred before the snapshot. Using this approach, we will be exposing a [distributed state](https://github.com/apache/mesos/blob/master/src/state/state.hpp) abstraction in Mesos with replicated log as a backend.
+
+Similarly, the replicated log can be used to build [replicated state machines](https://en.wikipedia.org/wiki/State_machine_replication). In this scenario, each log entry contains a state machine command. Since replicas are strongly consistent, all servers will execute the same commands in the same order.
+
+## Implementation
+
+![Replicated Log Architecture](images/log-architecture.png)
+
+The replicated log uses the [Paxos consensus algorithm](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29) to ensure that all replicas agree on every log entry’s value. It is similar to what’s described in [these slides](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf). Readers who are familiar with Paxos can skip this section.
+
+The above figure is an implementation overview. When a user wants to append data to the log, the system creates a log writer. The log writer internally creates a coordinator. The coordinator contacts all replicas and executes the Paxos algorithm to make sure all replicas agree about the appended data. The coordinator is sometimes referred to as the [_proposer_](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29).
+
+Each replica keeps an array of log entries. The array index is the log position. Each log entry is composed of three components: the value written by the user, the associated Paxos state and a _learned_ bit where true means this log entry’s value has been agreed. Therefore, a replica in our implementation is both an [_acceptor_](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29) and a [_learner_](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29).
+
+### Reaching consensus for a single log entry
+
+A Paxos round can help all replicas reach consensus on a single log entry’s value. It has two phases: a promise phase and a write phase. Note that we are using slightly different terminology from the [original Paxos paper](https://research.microsoft.com/en-us/um/people/lamport/pubs/paxes-simple.pdf). In our implementation, the _prepare_ and _accept_ phases in the original paper are referred to as the _promise_ and _write_ phases, respectively. Consequently, a prepare request (response) is referred to as a promise request (response), and an accept request (response) is referred to as a write request (response).
+
+To append value _X_ to the log at position _p_, the coordinator first broadcasts a promise request to all replicas with proposal number _n_, asking replicas to promise that they will not respond to any request (promise/write request) with a proposal number lower than _n_. We assume that _n_ is higher than any other previously used proposal number, and will explain how we do this later.
+
+When receiving the promise request, each replica checks its Paxos state to decide if it can safely respond to the request, depending on the promises it has previously given out. If the replica is able to give the promise (i.e., passes the proposal number check), it will first persist its promise (the proposal number _n_) on disk and reply with a promise response. If the replica has been previously written (i.e., accepted a write request), it needs to include the previously written value along with the proposal number used in that write request into the promise response it’s about to send out.
+
+Upon receiving promise responses from a [quorum](https://en.wikipedia.org/wiki/Quorum_%28distributed_computing%29) of replicas, the coordinator first checks if there exist any previously written value from those responses. The append operation cannot continue if a previously written value is found because it’s likely that a value has already been agreed on for that log entry. This is one of the key ideas in Paxos: restrict the value that can be written to ensure consistency.
+
+If no previous written value is found, the coordinator broadcasts a write request to all replicas with value _X_ and proposal number _n_. On receiving the write request, each replica checks the promise it has given again, and replies with a write response if the write request’s proposal number is equal to or larger than the proposal number it has promised. Once the coordinator receives write responses from a quorum of replicas, the append operation succeeds.
+
+### Optimizing append latency using Multi-Paxos
+
+One naive solution to implement a replicated log is to run a full Paxos round (promise phase and write phase) for each log entry. As discussed in the [original Paxos paper](https://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf), if the leader is relatively stable, _Multi-Paxos_ can be used to eliminate the need for the promise phase for most of the append operations, resulting in improved performance.
+
+To do that, we introduce a new type of promise request called an _implicit_ promise request. An implicit promise request can be viewed as a _batched_ promise request for a (potentially infinite) set of log entries. Broadcasting an implicit promise request is conceptually equivalent to broadcasting a promise request for every log entry whose value has not yet been agreed. If the implicit promise request broadcasted by a coordinator gets accepted by a quorum of replicas, this coordinator is no longer required to run the promise phase if it wants to append to a log entry whose value has not yet been agreed because the promise phase has already been done in _batch_. The coordinator in this case is therefore called _elected_ (a.k.a., the leader), and has _exclusive_ access to the replicated log. An elected coordinator may be _demoted_ (or lose exclusive access) if another coordinator broadcasts an implicit promise request with a higher proposal number.
+
+One question remaining is how can we find out those log entries whose values have not yet been agreed. We have a very simple solution: if a replica accepts an implicit promise request, it will include its largest known log position in the response. An elected coordinator will only append log entries at positions larger than _p_, where _p_ is greater than any log position seen in these responses.
+
+Multi-Paxos has better performance if the leader is stable. The replicated log itself does not perform leader election. Instead, we rely on the user of the replicated log to choose a stable leader. For example, Aurora uses [ZooKeeper](https://zookeeper.apache.org/) to elect the leader.
+
+### Enabling local reads
+
+As discussed above, in our implementation, each replica is both an acceptor and a learner. Treating each replica as a learner allows us to do local reads without involving other replicas. When a log entry’s value has been agreed, the coordinator will broadcast a _learned_ message to all replicas. Once a replica receives the learned message, it will set the learned bit in the corresponding log entry, indicating the value of that log entry has been agreed. We say a log entry is "learned" if its learned bit is set. The coordinator does not have to wait for replicas’ acknowledgments.
+
+To perform a read, the log reader will directly look up the underlying local replica. If the corresponding log entry is learned, the reader can just return the value to the user. Otherwise, a full Paxos round is needed to discover the agreed value. We always make sure that the replica co-located with the elected coordinator always has all log entries learned. We achieve that by running full Paxos rounds for those unlearned log entries after the coordinator is elected.
+
+### Reducing log size using garbage collection
+
+In case the log grows large, the application has the choice to truncate the log. To perform a truncation, we append a special log entry whose value is the log position to which the user wants to truncate the log. A replica can actually truncate the log once this special log entry has been learned.
+
+### Unique proposal number
+
+Many of the [Paxos research papers](https://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) assume that each proposal number is globally unique, and a coordinator can always come up with a proposal number that is larger than any other proposal numbers in the system. However, implementing this is not trivial, especially in a distributed environment. [Some researchers suggest](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf) concatenating a globally unique server id to each proposal number. But it is still not clear how to generate a globally unique id for each server.
+
+Our solution does not make the above assumptions. A coordinator can use an arbitrary proposal number initially. During the promise phase, if a replica knows a proposal number higher than the proposal number used by the coordinator, it will send the largest known proposal number back to the coordinator. The coordinator will retry the promise phase with a higher proposal number.
+
+To avoid livelock (e.g., when two coordinators completing), we inject a randomly delay between T and 2T before each retry. T has to be chosen carefully. On one hand, we want T >> broadcast time such that one coordinator usually times out and wins before others wake up. On the other hand, we want T to be as small as possible such that we can reduce the wait time. Currently, we use T = 100ms. This idea is actually borrowed from [Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
+
+## Automatic replica recovery
+
+The algorithm described above has a critical vulnerability: if a replica loses its durable state (i.e., log files) due to either disk failure or operational error, that replica may cause inconsistency in the log if it is simply restarted and re-added to the group. The operator needs to stop the application on all hosts, copy the log files from the leader’s host, and then restart the application. Note that the operator cannot copy the log files from an arbitrary replica because copying an unlearned log entry may falsely assemble a quorum for an incorrect value, leading to inconsistency.
+
+To avoid the need for operator intervention in this situation, the Mesos replicated log includes support for _auto recovery_. As long as a quorum of replicas is working properly, the users of the application won’t notice any difference.
+
+### Non-voting replicas
+
+To enable auto recovery, a key insight is that a replica that loses its durable state should not be allowed to respond to requests from coordinators after restart. Otherwise, it may introduce inconsistency in the log as it could have accepted a promise/write request which it would not have accepted if its previous Paxos state had not been lost.
+
+To solve that, we introduce a new status variable for each replica. A normal replica is said in VOTING status, meaning that it is allowed to respond to requests from coordinators. A replica with no persisted state is put in EMPTY status by default. A replica in EMPTY status is not allowed to respond to any request from coordinators.
+
+A replica in EMPTY status will be promoted to VOTING status if the following two conditions are met:
+
+1. a sufficient amount of missing log entries are recovered such that if other replicas fail, the remaining replicas can recover all the learned log entries, and
+2. its future responses to a coordinator will not break any of the promises (potentially lost) it has given out.
+
+In the following, we discuss how we achieve these two conditions.
+
+### Catch-up
+
+To satisfy the above two conditions, a replica needs to perform _catch-up_ to recover lost states. In other words, it will run Paxos rounds to find out those log entries whose values that have already been agreed. The question is how many log entries the local replica should catch-up before the above two conditions can be satisfied.
+
+We found that it is sufficient to catch-up those log entries from position _begin_ to position _end_ where _begin_ is the smallest position seen in a quorum of VOTING replicas and _end_ is the largest position seen in a quorum of VOTING replicas.
+
+Here is our correctness argument. For a log entry at position _e_ where _e_ is larger than _end_, obviously no value has been agreed on. Otherwise, we should find at least one VOTING replica in a quorum of replicas such that its end position is larger than _end_. For the same reason, a coordinator should not have collected enough promises for the log entry at position _e_. Therefore, it's safe for the recovering replica to respond requests for that log entry. For a log entry at position _b_ where _b_ is smaller than _begin_, it should have already been truncated and the truncation should have already been agreed. Therefore, allowing the recovering replica to respond requests for that position is also safe.
+
+### Auto initialization
+
+Since we don’t allow an empty replica (a replica in EMPTY status) to respond to requests from coordinators, that raises a question for bootstrapping because initially, each replica is empty. The replicated log provides two choices here. One choice is to use a tool (`mesos-log) to explicitly initialize the log on each replica by setting the replica's status to VOTING, but that requires an extra step when setting up an application.
+
+The other choice is to do automatic initialization. Our idea is: we allow a replica in EMPTY status to become VOTING immediately if it finds all replicas are in EMPTY status. This is based on the assumption that the only time _all_ replicas are in EMPTY status is during start-up. This may not be true if a catastrophic failure causes all replicas to lose their durable state, and that's exactly the reason we allow conservative users to disable auto-initialization.
+
+To do auto-initialization, if we use a single-phase protocol and allow a replica to directly transit from EMPTY status to VOTING status, we may run into a state where we cannot make progress even if all replicas are in EMPTY status initially. For example, say the quorum size is 2. All replicas are in EMPTY status initially. One replica will first set its status to VOTING because if finds all replicas are in EMPTY status. After that, neither the VOTING replica nor the EMPTY replicas can make progress. To solve this problem, we use a two-phase protocol and introduce an intermediate transient status (STARTING) between EMPTY and VOTING status. A replica in EMPTY status can transit to STARTING status if it finds all replicas are in either EMPTY or STARTING status. A replica in STARTING status can transit to VOTING status if it finds all replicas are in either STARTING or VOTING status. In that way, in our previous example, all replicas will be in STARTING status before any of them can tr
 ansit to VOTING status.
+
+## Future work
+
+Currently, replicated log does not support dynamic quorum size change, also known as _reconfiguration_. Supporting reconfiguration would allow us more easily to add, move or swap hosts for replicas. We plan to support reconfiguration in the future.