You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2016/10/25 19:39:58 UTC

[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint

    [ https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606261#comment-15606261 ] 

Neil Conway commented on MESOS-6078:
------------------------------------

FYI, we will likely address this as part of the in-progress work on supporting {{TASK_GONE}} and {{TASK_GONE_BY_OPERATOR}}. Workflow:

* framework opts-in to the {{PARTITION_AWARE}} capability.
* if Mesos can _prove_ that the agent ID is gone (e.g., because the agent reboots, changes its boot ID, and then an agent using the same {{work_dir}} registers and receives a new agent ID), the framework will get {{TASK_GONE}} status updates for all tasks on the agent.
* if the operator has some out-of-band knowledge that the agent will never attempt to re-register and all of its tasks are no longer running, we'll provide an operator HTTP endpoint (e.g., /agent/gone) that the operator can hit. When this happens, the framework will receive {{TASK_GONE_BY_OPERATOR}} status updates for all tasks on the agent.

In the meantime, the {{/machine/down}} endpoint might help here -- it shouldn't be subject to the agent removal rate limit.

> Add a agent teardown endpoint
> -----------------------------
>
>                 Key: MESOS-6078
>                 URL: https://issues.apache.org/jira/browse/MESOS-6078
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Cody Maloney
>            Assignee: Michael Park
>              Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good (AWS terminated the instance without warning), it goes through the mesos slave removal rate limit before it's gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, this can get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but once the agent is gone, there currently is no good way for an adminitstrator to indicate the node is gone / gone and it's tasks are lost / should be rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)