You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by ya...@apache.org on 2014/04/07 19:56:46 UTC

git commit: Updated the "High Availability" doc to reflect the master detector refactoring in Mesos 0.16.0.

Repository: mesos
Updated Branches:
  refs/heads/master ca84d5f7c -> 7a3075a4b


Updated the "High Availability" doc to reflect the master detector refactoring in Mesos 0.16.0.

Review: https://reviews.apache.org/r/18155


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/7a3075a4
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/7a3075a4
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/7a3075a4

Branch: refs/heads/master
Commit: 7a3075a4b9dc203a55f98c275d86095588195098
Parents: ca84d5f
Author: Tom Galloway <tg...@twitter.com>
Authored: Fri Apr 4 12:00:21 2014 -0700
Committer: Jiang Yan Xu <ya...@jxu.me>
Committed: Mon Apr 7 10:55:50 2014 -0700

----------------------------------------------------------------------
 docs/high-availability.md | 67 ++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 25 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/7a3075a4/docs/high-availability.md
----------------------------------------------------------------------
diff --git a/docs/high-availability.md b/docs/high-availability.md
index 77af5d5..d8a4dd3 100644
--- a/docs/high-availability.md
+++ b/docs/high-availability.md
@@ -2,49 +2,66 @@
 layout: documentation
 ---
 
-# High Availability
+# Mesos High Availability Mode
 
-Mesos can run in high-availability mode, in which multiple Mesos masters run simultaneously. In this mode there is only one active master, and the others masters act as stand-by replacements. These non-active masters will take over if the active master fails.
+Mesos has a high-availability mode that uses multiple Mesos masters; one active master (called the leader or leading master) and several backups in case it fails. The masters elect the leader, with [Apache ZooKeeper](http://zookeeper.apache.org/) both coordinating the election and handling leader detection by masters, slaves, and scheduler drivers. More information regarding [how leader election works](http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection) is available on the Apache Zookeeper website.
 
-Mesos uses Apache ZooKeeper in to coordinate the leader election of masters.
+**Note**: This document assumes you know how to start, run, and work with ZooKeeper, whose client library is included in the standard Mesos build.
 
-Running mesos in this mode requires ZooKeeper. Mesos is automatically built with an included ZooKeeper client library.
+## Usage
+To put Mesos into high-availability mode:
 
-Leader election and detection in Mesos is done via ZooKeeper. See: http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection for more information how ZooKeeper leader election works.
+1. Ensure that the ZooKeeper cluster is up and running.
 
-## Use
-First, a ZooKeeper cluster must be running, and a znode should be created for exclusive use by Mesos:
+2. Provide the znode path to all masters, slaves, and framework schedulers as follows:
 
-* Ensure the ZooKeeper cluster is up and running.
-* Create a znode for exclusive use by mesos. The znode path will need to be provided to all slaves / scheduler drivers.
+    * Start the mesos-master binaries using the `--zk` flag, e.g. `--zk=zk://host1:port1/path,host2:port2/path...'  
 
-In order to spin up a Mesos cluster using multiple masters for fault-tolerance:
+    * Start the mesos-slave binaries with `--master=zk://host1:port1/path,host2:port2/path...`
 
-* Start the mesos-master binaries using the `--zk` flag, e.g. `--zk=zk://host1:port1/path,host2:port2/path,...`
-* Start the mesos-slave binaries with `--master=zk://host1:port1/path,host2:port2/path,...`
-* Start any framework schedulers using the same zk path, the SchedulerDriver must be constructed with this path.
+    * Start any framework schedulers using the same `zk` path as in the last two steps. The SchedulerDriver must be constructed with this path, as shown in the [Framework Development Guide]( http://mesos.apache.org/documentation/latest/app-framework-development-guide/).
 
-Refer to the Scheduler API for how to deal with leadership changes.
+From now on, the Mesos masters and slaves all communicate with ZooKeeper to find out which master is the current leading master. This is in addition to the usual communication between the leading master and the slaves.
 
-## Semantics
-The detector is implemented in the `src/detector` folder. In particular, we watch for several ZooKeeper session events:
+Refer to the [Scheduler API](http://mesos.apache.org/documentation/latest/app-framework-development-guide/) for how to deal with leadership changes.
+
+## Implementation Details
+Mesos implements two levels of ZooKeeper leader election abstractions, one in `src/zookeeper` and the other in `src/master` (look for `contender|detector.hpp|cpp`).
+
+* The lower level `LeaderContender` and `LeaderDetector` implement a generic ZooKeeper election algorithm loosely modeled after this
+[recipe](http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection) (sans herd effect handling due to the master group's small size, which is often 3).
+
+* The higher level `MasterContender` and `MasterDetector` wrap around ZooKeeper's contender and detector abstractions as adapters to provide/interpret the ZooKeeper data.
+
+* Each Mesos master simultaneously uses both a contender and a detector to try to elect themselves and detect who the current leader is. A separate detector is necessary because each master's WebUI redirects browser traffic to the current leader when that master is not elected. Other Mesos components (i.e. slaves and scheduler drivers) use the detector to find the current leader and connect to it.
+
+The notion of the group of leader candidates is implemented in `Group`. This abstraction handles reliable (through queues and retries of retryable errors under the covers) ZooKeeper group membership registration, cancellation, and monitoring. It watches for several ZooKeeper session events:
 
 * Connection
 * Reconnection
 * Session Expiration
 * ZNode creation, deletion, updates
 
-We also explicitly timeout our sessions, when disconnected from ZooKeeper for an amount of time, see: `ZOOKEEPER_SESSION_TIMEOUT`. This is because the ZooKeeper client libraries only notify of session expiration upon reconnection. These timeouts are of particular interest in the case of network partitions.
+We also explicitly timeout our sessions when disconnected from ZooKeeper for a specified amount of time. See `MASTER_CONTENDER_ZK_SESSION_TIMEOUT` and `MASTER_DETECTOR_ZK_SESSION_TIMEOUT`. This is because the ZooKeeper client libraries only notify of session expiration upon reconnection. These timeouts are of particular interest for network partitions.
+
+## Component Disconnection Handling
+
+When a network partition disconnects a component (master, slave, scheduler driver) from ZooKeeper, the component's Master Detector induces a timeout event. This notifies the component that it has no leading master. Depending on the component, the following happens. (Note that while a component is disconnected from ZooKeeper, a master may still be in communication with slaves or schedulers and vice versa.)
+
+* Slaves disconnected from ZooKeeper no longer know which master is the leader. They ignore messages from masters to ensure they don't act on a non-leader's decisions. When a slave reconnects to ZooKeeper, ZooKeeper informs it of the current leader and the slave stops ignoring messages from the leader.
+
+* Masters enter leaderless state irrespective of whether they are a leader or not before the disconnection.
+
+    * If the leader was disconnected from ZooKeeper, it aborts its process. The user/developer/administrator can start a new, connected to ZooKeeper, master instance that starts as a backup.
+
+    * Otherwise, the disconnected backup waits to reconnect with ZooKeeper and possibly get elected as the new leading master.
 
-## Network Partitions
-When a network partition occurs, if a particular component is disconnected from ZooKeeper, the Master Detector of the partitioned component will induce a timeout event. This causes the component to be notified that there is no leading master.
+* Scheduler drivers disconnected from the leading master notify the scheduler about their disconnection from the leader.
 
-When slaves are master-less, they ignore incoming messages from masters to ensure that we don't act on a non-leading master's decision.
+When a network partition disconnects a slave from the leader:
 
-When masters enter a leader-less state, they commit suicide.
+* The slave fails health checks from the leader.
 
-The following semantics are enforced:
+* The leader marks the slave as deactivated and sends its tasks to the LOST state. The  [Framework Development Guide](http://mesos.apache.org/documentation/latest/app-framework-development-guide/) describes these various task states.
 
-* If a slave is partitioned from the master, it will fail health-checks. The master will mark the slave as deactivated and send its tasks to LOST.
-* Deactivated slaves may not re-register with the master, and are instructed to shut down upon any further communication after deactivation.
-* When a slave is partitioned from ZooKeeper, it will enter a master-less state. It will ignore any master messages until a new master is elected.
\ No newline at end of file
+* Deactivated slaves may not re-register with the leader, and are told to shut down upon any post-deactivation communication.