You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2015/01/29 19:28:34 UTC

[jira] [Commented] (MESOS-2301) Slave does not cleanly unregister

    [ https://issues.apache.org/jira/browse/MESOS-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297292#comment-14297292 ] 

Vinod Kone commented on MESOS-2301:
-----------------------------------

How does a reboot cause the slave to cleanly shutdown? Does it get some kind of signal from the system?

{quote}
This leads to the master waiting for the slave to come back for the configured amount if time and not marking the tasks as lost or killed
{quote}

Do you mean the master never marks the tasks as LOST or just it takes a long time? If it's the old master it should mark them LOST after health check timeout. If it's a new master, it should mark them LOST after recovery timeout.

Am I missing something?

> Slave does not cleanly unregister
> ---------------------------------
>
>                 Key: MESOS-2301
>                 URL: https://issues.apache.org/jira/browse/MESOS-2301
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>            Reporter: Dario Rexin
>
> If a machine running the mesos slave is being rebooted, the mesos slave does a clean shutdown. It stops alls its executors, unregisters from the master and removes the symlink to the latest state. 
> However, if the master is not reachable during the reboot, it will still remove the symlink to the latest state and will register with a new ID when restarted. This leads to the master waiting for the slave to come back for the configured amount if time and not marking the tasks as lost or killed. This also means, that these tasks will not be restarted by the framework (in this case Marathon), because it assumes they are still alive.
> This problem could be solved by introducing a new message `SlaveUnregisteredMessage` that gets send by the master when a slave successfully unregistered. The slav only has to wait for this message and if it doesn't receive it, it should not remove the symlink to `latest`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)