You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Anand Mazumdar (JIRA)" <ji...@apache.org> on 2017/04/26 00:23:04 UTC

[jira] [Assigned] (MESOS-7426) Support for agent lifecycle management.

     [ https://issues.apache.org/jira/browse/MESOS-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anand Mazumdar reassigned MESOS-7426:
-------------------------------------

    Assignee: Anand Mazumdar

> Support for agent lifecycle management.
> ---------------------------------------
>
>                 Key: MESOS-7426
>                 URL: https://issues.apache.org/jira/browse/MESOS-7426
>             Project: Mesos
>          Issue Type: Epic
>          Components: agent
>            Reporter: Anand Mazumdar
>            Assignee: Anand Mazumdar
>              Labels: agent-lifecycle, mesosphere
>
> This epic co-ordinates the work for introducing agent lifecycle management in Mesos allowing a framework to be notified in case of agent node failures. The existing {{Event::Failure}} is not enough for frameworks to know that the given agent node isn't ever coming back.
> The primary motivations for introducing such a feature would be:
> - Currently, when an agent running a task fails, there is inherently an operator interference needed (manual step) to remove the node via a configuration API exposed by the framework e.g., dcos cassandra node replace for the cassandra framework. This needs to be done once for every stateful framework running on the cluster.
> - When an agent is marked as unhealthy, the removal rate is bounded if the `--agent_rate_removal_limit` option is set. This is specifically problematic for operators relying on EC2 autoscaling groups or for workload bursting to another cloud.
> - When an agent is marked as unhealthy, the removal rate is bounded if the `--agent_rate_removal_limit` option is set. This is specifically problematic for operators relying on EC2 autoscaling groups or for workload bursting to another cloud.
> - When the fault domain associated with an agent changes (e.g., it is moved from an unallocated rack to an allocated rack), there is no feedback mechanism for the framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)