You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/09/16 17:29:51 UTC

[jira] [Commented] (MESOS-695) Introduce automated self-healing and coordinated repair to Mesos

    [ https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768409#comment-13768409 ] 

Benjamin Mahler commented on MESOS-695:
---------------------------------------

I've forwarded this on to some SREs that run Mesos at Twitter, would love to see some SRE feedback on this ticket!

1) The master will shut down slaves currently when deemed to be in a bad state (this is merely a simple ping/pong health check mechanism currently). At twitter, we have monit to take care of restarting, but what could others use to do this? Should mesos provide monit-like functionality for masters / slaves? Is this something we should try to standardize? In most cases when slaves are "lost" like this, there is a generic callback invoked on schedulers ({{Scheduler::slaveLost}}).

I'd love to hear more elaboration on 2) and 3). For what classes of issues could reboots / re-imaging be automatically deemed necessary to repair a machine?

For 4) we currently keep track of slaves that have been deactivated due to failed health checks which acts as a simple blacklist, the offers for these slaves will be invalidated thus preventing any additional scheduled work on these slaves. Schedulers are currently notified of this through {{Scheduler::slaveLost}}.
                
> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
>                 Key: MESOS-695
>                 URL: https://issues.apache.org/jira/browse/MESOS-695
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>            Reporter: Jeff Currier
>
> One capability that is presently missing within the Mesos framework is the ability for the system to self-heal.  Specifically, the ability for a master to detect something is amiss with a particular host and then to attempt to heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the hosts within the system it's believed that we can get Mesos to function in more of a, 'lights out' mode thereby reducing the OpEx costs for running the system today.
> It should be noted that a certain amount of coordination will be required in order to ensure that we don't, 'repair" too many nodes at the same time.  This logic will need to be centralized and such that there is a central authority who is elected to make these decisions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira