You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2017/01/03 20:54:58 UTC

[jira] [Commented] (MESOS-6286) Master does not remove an agent if it is responsive but not registered

    [ https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796150#comment-15796150 ] 

Neil Conway commented on MESOS-6286:
------------------------------------

This problem has also been observed when the agent is stuck in recovery for an extended period of time, e.g., because a container has gotten into some weird state.

The simplest fix here might be to simply change the agent to not respond to {{PingSlaveMessage}} is the agent is not in the {{RUNNING}} state. With that change, an agent that is stuck in recovery indefinitely will eventually fail health checks; the framework will then receive {{TASK_LOST}} / {{TASK_UNREACHABLE}} status updates for any tasks on the agent, and can decide if/when to relaunch that work elsewhere. If the agent later finishes recovery, it will be allowed to re-register -- as normal, non-partition-aware tasks on the agent will be terminated and partition-aware tasks will be allowed to keep running. Any failed containers will be reported as terminal to the framework.

There should probably also be a mechanism to detect situations in which the agent fails to startup for an extended period, so that the operator can investigate the state of the agent. But that seems orthogonal to this issue.

> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6286
>                 URL: https://issues.apache.org/jira/browse/MESOS-6286
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Joseph Wu
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet.  The agent needs to recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes.  Repeat (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks on this agent.
> * Executors on the agent can connect to the agent, but will not be able to register.
> We should consider adding some timeout/intervention in the master for responsive, but non-recoverable agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)