You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/07/16 02:16:49 UTC

[jira] [Resolved] (MESOS-525) Slave should kill tasks when disconnected from the master for longer than the health check timeout.

     [ https://issues.apache.org/jira/browse/MESOS-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler resolved MESOS-525.
-----------------------------------

    Resolution: Won't Fix
    
> Slave should kill tasks when disconnected from the master for longer than the health check timeout.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-525
>                 URL: https://issues.apache.org/jira/browse/MESOS-525
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> The following scenario was observed in production at Twitter:
> 1. Task T beings running on a slave at
> I0618 02:54:38.069694 15362 slave.cpp:830] Status update: task T of framework F is now in state TASK_RUNNING
> 2. Due to a network partition, the slave is removed from the master for failing health checks:
> W0618 23:56:18.063217 28745 master.cpp:1172] Removing slave 201304011727-2230002186-5050-28738-3217 at S:5051 because it has been deactivated
> I0618 23:56:18.068821 28745 master.cpp:1181] Master now considering a slave at S:5051 as inactive
> 3. The task stayed running on the partitioned slave for 6 days! Until a user manually killed the process and the executor marked it as finished:
> I0624 20:20:57.565053 15380 slave.cpp:830] Status update: task 1371524058397-ads-adshard-production-153-a4504eb0-384b-4600-b6fe-e080c87bd84e of framework 201104070004-0000002563-0000 is now in state TASK_FINISHED
> There are a few ways to fix this in the slave, these rely on the fact that the master will have marked the tasks as LOST when it removed the slave, after which point we don't want the tasks to continue running.
>   1. Have the slave commit suicide after (<health_check_failure_timeout> + buffer) amount of time of disconnection with the master. This only works well when cgroups is in use to ensure the next run of the slave cleans up properly. And this gets messier with slave recovery.
>   2. A cleaner approach would be to have the slave kill all executors running under it. We most likely want to send TASK_LOST updates for the tasks although this will mean duplicate updates unless the master handles these correctly. Alternatively, we can avoid sending any updates, but we'll need to guarantee that the updates were sent by the master.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira