You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/12/11 23:10:13 UTC

[jira] [Updated] (MESOS-1503) Improve slave health checking to prevent rapid widespread slave removals.

     [ https://issues.apache.org/jira/browse/MESOS-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler updated MESOS-1503:
-----------------------------------
    Priority: Critical  (was: Major)

> Improve slave health checking to prevent rapid widespread slave removals.
> -------------------------------------------------------------------------
>
>                 Key: MESOS-1503
>                 URL: https://issues.apache.org/jira/browse/MESOS-1503
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Benjamin Mahler
>            Assignee: Timothy Chen
>            Priority: Critical
>              Labels: reliability
>
> Per some discussions with [~tweingartner] and [~vinodkone].
> Currently the master uses a SlaveObserver for each registered slave. Each SlaveObserver operates independently and makes decisions about whether the slave is healthy.
> The independence of these observers means that in some very rare events (e.g. masters are partitioned from 75% of slaves), the master can very rapidly remove a large portion of the slaves in the cluster. Ideally such an event could be deemed dangerous and throttled accordingly through a more intelligent notion of overall cluster health.
> It may be nice to have a single observer that is responsible for health checking all the slaves. This will allow us to make safer decisions as to when to determine that slaves are unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)