You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mesos.apache.org by vi...@apache.org on 2016/03/08 01:48:18 UTC

[1/4] mesos git commit: Revised comments about `link` semantics in libprocess.

Repository: mesos
Updated Branches:
  refs/heads/master b770eecfc -> 3da0a2c13


Revised comments about `link` semantics in libprocess.

Review: https://reviews.apache.org/r/44476/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/ff8dafe1
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/ff8dafe1
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/ff8dafe1

Branch: refs/heads/master
Commit: ff8dafe1ac479a0b9454b72ec126824d0b97e5fa
Parents: b770eec
Author: Neil Conway <ne...@gmail.com>
Authored: Mon Mar 7 19:47:46 2016 -0500
Committer: Vinod Kone <vi...@gmail.com>
Committed: Mon Mar 7 19:47:46 2016 -0500

----------------------------------------------------------------------
 3rdparty/libprocess/include/process/process.hpp | 24 ++++++++++++++++----
 1 file changed, 19 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/ff8dafe1/3rdparty/libprocess/include/process/process.hpp
----------------------------------------------------------------------
diff --git a/3rdparty/libprocess/include/process/process.hpp b/3rdparty/libprocess/include/process/process.hpp
index c9ef4e8..45ccf9d 100644
--- a/3rdparty/libprocess/include/process/process.hpp
+++ b/3rdparty/libprocess/include/process/process.hpp
@@ -105,6 +105,15 @@ protected:
   /**
    * Invoked when a linked process has exited.
    *
+   * For local linked processes (i.e., when the linker and linkee are
+   * part of the same OS process), this can be used to reliably detect
+   * when the linked process has exited.
+   *
+   * For remote linked processes, this indicates that the persistent
+   * TCP connection between the linker and the linkee has failed
+   * (e.g., linkee process died, a network error occurred). In this
+   * situation, the remote linkee process might still be running.
+   *
    * @see process::ProcessBase::link
    */
   virtual void exited(const UPID&) {}
@@ -112,6 +121,8 @@ protected:
   /**
    * Invoked when a linked process can no longer be monitored.
    *
+   * TODO(neilc): This is not implemented.
+   *
    * @see process::ProcessBase::link
    */
   virtual void lost(const UPID&) {}
@@ -141,11 +152,14 @@ protected:
   /**
    * Links with the specified `UPID`.
    *
-   * Linking with a process from within the same "operating system
-   * process" is guaranteed to give you perfect monitoring of that
-   * process. However, linking with a process on another machine might
-   * result in receiving lost callbacks due to the nature of a distributed
-   * environment.
+   * Linking with a process from within the same OS process is
+   * guaranteed to give you perfect monitoring of that process.
+   *
+   * Linking to a remote process establishes a persistent TCP
+   * connection to the remote libprocess instance that hosts that
+   * process. If the TCP connection fails, the true state of the
+   * remote linked process cannot be determined; we handle this
+   * situation by generating an ExitedEvent.
    */
   UPID link(const UPID& pid);

[3/4] mesos git commit: Revised slave recovery documentation.

Posted by vi...@apache.org.

Revised slave recovery documentation.

Review: https://reviews.apache.org/r/44478/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/380e5708
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/380e5708
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/380e5708

Branch: refs/heads/master
Commit: 380e5708a8bfdfa6c640067d714a233482d2d75e
Parents: 94d2ee3
Author: Neil Conway <ne...@gmail.com>
Authored: Mon Mar 7 19:47:57 2016 -0500
Committer: Vinod Kone <vi...@gmail.com>
Committed: Mon Mar 7 19:47:57 2016 -0500

----------------------------------------------------------------------
 docs/home.md           |  2 +-
 docs/slave-recovery.md | 74 +++++++++++++++++++++------------------------
 2 files changed, 35 insertions(+), 41 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/380e5708/docs/home.md
----------------------------------------------------------------------
diff --git a/docs/home.md b/docs/home.md
index 821026a..0b2d2f9 100644
--- a/docs/home.md
+++ b/docs/home.md
@@ -30,7 +30,7 @@ layout: documentation
 * [Operational Guide](operational-guide.md)
 * [Monitoring](monitoring.md)
 * [Network Monitoring and Isolation](network-monitoring.md)
-* [Slave Recovery](slave-recovery.md) for doing seamless upgrades.
+* [Slave Recovery](slave-recovery.md) for doing seamless slave upgrades and allowing executors to survive `mesos-slave` crashes.
 * [Maintenance](maintenance.md) for performing maintenance on a Mesos cluster.
 * [Tools](tools.md) for setting up and running a Mesos cluster.
 * [SSL](ssl.md) for enabling and enforcing SSL communication.

http://git-wip-us.apache.org/repos/asf/mesos/blob/380e5708/docs/slave-recovery.md
----------------------------------------------------------------------
diff --git a/docs/slave-recovery.md b/docs/slave-recovery.md
index 5c148e5..ff584f0 100644
--- a/docs/slave-recovery.md
+++ b/docs/slave-recovery.md
@@ -5,61 +5,60 @@ layout: documentation
 
 # Slave Recovery
 
-Slave recovery is a feature of Mesos that allows:
+If the `mesos-slave` process on a host exits (perhaps due to a Mesos bug or
+because the operator kills the process while [upgrading Mesos](upgrades.md)),
+any executors/tasks that were being managed by the `mesos-slave` process will
+continue to run. When `mesos-slave` is restarted, the operator can control how
+those old executors/tasks are handled:
 
- 1. Executors/tasks to keep running when the slave process is down and
- 2. Allows a restarted slave process to reconnect with running executors/tasks on the slave.
+ 1. By default, all the executors/tasks that were being managed by the old
+    `mesos-slave` process are killed.
+ 2. If a framework enabled _checkpointing_ when it registered with the master,
+    any executors belonging to that framework can reconnect to the new
+    `mesos-slave` process and continue running uninterrupted.
 
-Mesos slave could be restarted for an upgrade or due to a crash. This feature is introduced in ***0.14.0*** release.
+Hence, enabling framework checkpointing enables tasks to tolerate Mesos slave
+upgrades and unexpected `mesos-slave` crashes without experiencing any
+downtime.
 
-## How does it work?
+Slave recovery works by having the slave _checkpoint_ information (e.g., Task
+Info, Executor Info, Status Updates) about the tasks and executors it is
+managing to local disk. If a framework enables checkpointing, any subsequent
+slave restarts will recover the checkpointed information and reconnect with any
+executors that are still running.
 
-Slave recovery works by having the slave checkpoint enough information (e.g., Task Info, Executor Info, Status Updates) about the running tasks and executors to local disk. Once a framework enables checkpointing, any subsequent slave restarts would recover the checkpointed information and reconnect with the executors. Note that if the host running the slave process is rebooted all the executors/tasks are killed.
+Note that if the operating system on the slave is rebooted, all executors and
+tasks running on the host are killed and are not automatically restarted when
+the host comes back up.
 
-> NOTE: To enable recovery the framework should explicitly request checkpointing.
-> Alternatively, a framework that doesn't want the disk i/o overhead of checkpointing can opt out of checkpointing.
+## Framework Configuration
 
+A framework can control whether its executors will be recovered by setting the `checkpoint` flag in its `FrameworkInfo` when registering with the master. Enabling this feature results in increased I/O overhead at each slave that runs tasks launched by the framework. By default, frameworks do **not** checkpoint their state.
 
-## Enabling slave checkpointing
-> NOTE: From Mesos 0.22.0 slave checkpointing will be automatically enabled for all slaves.
+## Slave Configuration
 
-As part of this feature, 4 new flags were added to the slave.
+Three [configuration flags](configuration.md) control the recovery behavior of a Mesos slave:
 
-* `checkpoint` :  Whether to checkpoint slave and frameworks information
-                  to disk [Default: true].
-    - This enables a restarted slave to recover status updates and reconnect
-      with (--recover=reconnect) or kill (--recover=cleanup) old executors.
-    > NOTE: From Mesos 0.22.0 this flag will be removed as it will be enabled for all slaves.
-
-* `strict` : Whether to do recovery in strict mode [Default: true].
-    - If strict=true, any and all recovery errors are considered fatal.
+* `strict`: Whether to do slave recovery in strict mode [Default: true].
+    - If strict=true, all recovery errors are considered fatal.
     - If strict=false, any errors (e.g., corruption in checkpointed data) during recovery are
       ignored and as much state as possible is recovered.
 
-* `recover` : Whether to recover status updates and reconnect with old executors [Default: reconnect].
-    - If recover=reconnect, Reconnect with any old live executors.
-    - If recover=cleanup, Kill any old live executors and exit.
-      Use this option when doing an incompatible slave or executor upgrade!).
+* `recover`: Whether to recover status updates and reconnect with old executors [Default: reconnect].
+    - If recover=reconnect, reconnect with any old live executors, provided the executor's framework enabled checkpointing.
+    - If recover=cleanup, kill any old live executors and exit. Use this option when doing an incompatible slave or executor upgrade!
     > NOTE: If no checkpointing information exists, no recovery is performed
     > and the slave registers with the master as a new slave.
 
-* `recovery_timeout` : Amount of time allotted for the slave to recover [Default: 15 mins].
+* `recovery_timeout`: Amount of time allotted for the slave to recover [Default: 15 mins].
     - If the slave takes longer than `recovery_timeout` to recover, any executors that are waiting to
       reconnect to the slave will self-terminate.
-    > NOTE: This flag is only applicable when `--checkpoint` is enabled.
 
 > NOTE: If none of the frameworks have enabled checkpointing,
-> executors/tasks of frameworks die when the slave dies and are not recovered.
-
-A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer
-than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks.
-Therefore, it is highly recommended to automate the process of restarting a slave (e.g, using [monit](http://mmonit.com/monit/)).
-
-**For the complete list of slave options: ./mesos-slave.sh --help**
-
-## Enabling framework checkpointing
+> the executors and tasks running at a slave die when the slave dies
+> and are not recovered.
 
-As part of this feature, `FrameworkInfo` has been updated to include an optional `checkpoint` field. A framework that would like to opt in to checkpointing should set `FrameworkInfo.checkpoint=True` before registering with the master.
+A restarted slave should re-register with master within a timeout (75 seconds by default: see the `--max_slave_ping_timeouts` and `--slave_ping_timeout` [configuration flags](configuration.md)). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn will shutdown any live executors/tasks.  Therefore, it is highly recommended to automate the process of restarting a slave (e.g., using a process supervisor such as [monit](http://mmonit.com/monit/) or `systemd`).
 
 ## Known issues with `systemd` and POSIX isolation
 
@@ -74,8 +73,3 @@ KillMode=process
 ```
 
 > NOTE: There are also known issues with using `systemd` and raw `cgroups` based isolation, for now the suggested non-Posix isolation mechanism is to use Docker containerization.
-
-
-## Upgrading to 0.14.0
-
-If you want to upgrade a running Mesos cluster to 0.14.0 to take advantage of slave recovery please follow the [upgrade instructions](upgrades.md).

[4/4] mesos git commit: Revised HA framework guide documentation.

Posted by vi...@apache.org.

Revised HA framework guide documentation.

Review: https://reviews.apache.org/r/44479/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/3da0a2c1
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/3da0a2c1
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/3da0a2c1

Branch: refs/heads/master
Commit: 3da0a2c130d9bb59045020f19d878e5f0c5aba8f
Parents: 380e570
Author: Neil Conway <ne...@gmail.com>
Authored: Mon Mar 7 19:48:03 2016 -0500
Committer: Vinod Kone <vi...@gmail.com>
Committed: Mon Mar 7 19:48:03 2016 -0500

----------------------------------------------------------------------
 docs/high-availability-framework-guide.md | 87 ++++++++++++++++----------
 1 file changed, 53 insertions(+), 34 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/3da0a2c1/docs/high-availability-framework-guide.md
----------------------------------------------------------------------
diff --git a/docs/high-availability-framework-guide.md b/docs/high-availability-framework-guide.md
index 111d7b9..62f4c56 100644
--- a/docs/high-availability-framework-guide.md
+++ b/docs/high-availability-framework-guide.md
@@ -224,9 +224,12 @@ using two different mechanisms:
 
 1. The state of a persistent TCP connection between the master and the agent.
 
-2. Health checks via periodic ping messages to the agent which are expected to
-   be responded with pongs (this behavior is controlled by the
-   `--slave_ping_timeout` and `--max_slave_ping_timeouts` master flags).
+2. _Health checks_ using periodic ping messages to the agent. The master sends
+   "ping" messages to the agent and expects a "pong" response message within a
+   configurable timeout. The agent is considered to have failed if it does not
+   respond promptly to a certain number of ping messages in a row. This behavior
+   is controlled by the `--slave_ping_timeout` and `--max_slave_ping_timeouts`
+   master flags.
 
 If the persistent TCP connection to the agent breaks or the agent fails health
 checks, the master decides that the agent has failed and takes steps to remove
@@ -244,35 +247,51 @@ it from the cluster. Specifically:
     executors are considered lost. The master immediately sends `TASK_LOST`
     status updates for the tasks. These updates are not delivered reliably to
     the scheduler (see NOTE below). The agent is given a chance to reconnect
-    until health checks timeout.
-
-* If the agent fails health checks it is scheduled for removal. The removals can
+    until health checks timeout. If the agent does reconnect, any tasks for
+    which `TASK_LOST` updates were previously sent will be killed.
+
+    * The rationale for this behavior is that, using typical TCP settings, an
+      error in the persistent TCP connection between the master and the agent is
+      more likely to correspond to an agent error (e.g., the `mesos-slave`
+      process terminating unexpectedly) than a network partition, because the
+      Mesos health-check timeouts are much smaller than the typical values of
+      the corresponding TCP-level timeouts. Since non-checkpointing frameworks
+      will not survive a restart of the `mesos-slave` process, the master sends
+      `TASK_LOST` status updates so that these tasks can be rescheduled
+      promptly.  Of course, the heuristic that TCP errors do not correspond to
+      network partitions may not be true in some environments.
+
+* If the agent fails health checks, it is scheduled for removal. The removals can
   be rate limited by the master (see `---slave_removal_rate_limit` master flag)
   to avoid removing a slew of slaves at once (e.g., during a network partition).
 
-* Once it is time to remove an agent, the master marks it as "removed" in the
-  master's durable state (this will survive master failover). If an agent marked
-  as "removed" attempts to reconnect to the master (e.g., after a network
-  partition is healed), the connection attempt will be refused and the agent
-  will be asked to shutdown. The agent will then shutdown all running tasks and
-  executors, but any persistent volumes and dynamic reservations will be
-  preserved.
-
-  * To allow the removed agent node to rejoin the cluster, a new `mesos-slave`
-    process can be started. This will ensure the agent will receive a new agent
-    ID. The agent can then register with the master, and can also retain any
-    previously created persistent volumes and dynamic reservations. In effect,
-    the agent will be treated as a newly joined agent.
-
-* For each agent that is marked "removed", the scheduler receives a `slaveLost`
-  callback. The scheduler will also receive `TASK_LOST` status updates for each
-  task that was running on a removed agent.
-
-    >NOTE: Neither the callback nor the updates are reliably delivered by the
-    master. For example, if the master or scheduler fails over or there is a
+* When it is time to remove an agent, the master marks the agent as "removed" in
+  the master's [durable state](replicated-log-internals.md) (this will survive
+  master failover). The master sends a `slaveLost` callback to every registered
+  scheduler driver; it also sends `TASK_LOST` status updates for every task that
+  was running on the removed agent.
+
+    >NOTE: Neither the callback nor the status updates are delivered reliably by
+    the master. For example, if the master or scheduler fails over or there is a
     network connectivity issue during the delivery of these messages, they will
     not be resent.
 
+* Meanwhile, any tasks at the removed agent will continue to run and the agent
+  will repeatedly attempt to reconnect to the master. Once a removed agent is
+  able to reconnect to the master (e.g., because the network partition has
+  healed), the reregistration attempt will be refused and the agent will be
+  asked to shutdown. The agent will then shutdown all running tasks and
+  executors.  Persistent volumes and dynamic reservations on the removed agent
+  will be preserved.
+
+  * A removed agent can rejoin the cluster by starting a new copy of the
+    `mesos-slave` process. When a removed agent is shutdown by the master, Mesos
+    ensures that the next time `mesos-slave` is started (using the same work
+    directory at the same host), the agent will receive a new agent ID; in
+    effect, the agent will be treated as a newly joined agent. The agent will
+    retain any previously created persistent volumes and dynamic reservations,
+    although the agent ID associated with these resources will have changed.
+
 Typically, frameworks respond to failed or partitioned agents by scheduling new
 copies of the tasks that were running on the lost agent. This should be done
 with caution, however: it is possible that the lost agent is still alive, but is
@@ -286,19 +305,19 @@ framework authors.
 ## Dealing with Partitioned or Failed Masters
 
 The behavior described above does not apply during the period immediately after
-a new Mesos master is elected. As noted above, most Mesos master state is kept
-in-memory; hence, when the leading master fails and a new master is elected, the
-new master will have little knowledge of the current state of the cluster.
-Instead, it rebuilds this information as the frameworks and agents notice that a
-new master has been elected and then _reregister_ with it.
+a new Mesos master is elected. As noted above, most Mesos master state is only
+kept in memory; hence, when the leading master fails and a new master is
+elected, the new master will have little knowledge of the current state of the
+cluster.  Instead, it rebuilds this information as the frameworks and agents
+notice that a new master has been elected and then _reregister_ with it.
 
 ### Framework Reregistration
 When master failover occurs, frameworks that were connected to the previous
-leading master should reconnect to the new leading master. The
-`MesosSchedulerDriver` handles most of the details of detecting when the
+leading master should reconnect to the new leading
+master. `MesosSchedulerDriver` handles most of the details of detecting when the
 previous leading master has failed and connecting to the new leader; when the
 framework has successfully reregistered with the new leading master, the
-`reregistered` scheduler callback will be invoked.
+`reregistered` scheduler driver callback will be invoked.
 
 ### Agent Reregistration
 During the period after a new master has been elected but before a given agent

[2/4] mesos git commit: Fixed typo in slave's `--help` output.

Posted by vi...@apache.org.

Fixed typo in slave's `--help` output.

Review: https://reviews.apache.org/r/44477/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/94d2ee3c
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/94d2ee3c
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/94d2ee3c

Branch: refs/heads/master
Commit: 94d2ee3c472d9ccf0891205f85927b2ba2515199
Parents: ff8dafe
Author: Neil Conway <ne...@gmail.com>
Authored: Mon Mar 7 19:47:52 2016 -0500
Committer: Vinod Kone <vi...@gmail.com>
Committed: Mon Mar 7 19:47:52 2016 -0500

----------------------------------------------------------------------
 src/slave/flags.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/94d2ee3c/src/slave/flags.cpp
----------------------------------------------------------------------
diff --git a/src/slave/flags.cpp b/src/slave/flags.cpp
index 6e3fd69..eb47015 100644
--- a/src/slave/flags.cpp
+++ b/src/slave/flags.cpp
@@ -210,7 +210,7 @@ mesos::internal::slave::Flags::Flags()
 
   add(&Flags::registration_backoff_factor,
       "registration_backoff_factor",
-      "slave initially picks a random amount of time between `[0, b]`, where\n"
+      "Slave initially picks a random amount of time between `[0, b]`, where\n"
       "`b = registration_backoff_factor`, to (re-)register with a new master.\n"
       "Subsequent retries are exponentially backed off based on this\n"
       "interval (e.g., 1st retry uses a random value between `[0, b * 2^1]`,\n"