You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2017/11/13 20:19:00 UTC

[jira] [Commented] (MESOS-8111) Mesos sees task as running, but cannot kill it because the agent is offline

    [ https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250166#comment-16250166 ] 

Vinod Kone commented on MESOS-8111:
-----------------------------------

What framework are you using? I'm assuming marathon because you are using DC/OS.
 
There is a default rate limit of 1 in 20 min in DC/OS for the master to mark a disconnected agent as unreachable. If you have a more than one agent disconnect / scaled down at the same time, it would take quite a bit for master to recognize that.

Also, can you share the master, scheduler and agent logs for around the specific task and during disconnection? That would help us diagnose this better.


> Mesos sees task as running, but cannot kill it because the agent is offline
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-8111
>                 URL: https://issues.apache.org/jira/browse/MESOS-8111
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.2.3
>         Environment: DC/OS 1.9.4
>            Reporter: Cosmin Lehene
>
> After scaling down a cluster, the master is reporting a task as running although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.000000  6976 master.cpp:4913] Processing KILL call for task 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
> W1018 16:55:22.000000  6976 master.cpp:5000] Cannot kill task spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)