You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/05/28 23:53:02 UTC

[jira] [Commented] (MESOS-1388) Inconsistent terminal task state between master and re-registering slave

    [ https://issues.apache.org/jira/browse/MESOS-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011659#comment-14011659 ] 

Benjamin Mahler commented on MESOS-1388:
----------------------------------------

https://reviews.apache.org/r/21991/

> Inconsistent terminal task state between master and re-registering slave
> ------------------------------------------------------------------------
>
>                 Key: MESOS-1388
>                 URL: https://issues.apache.org/jira/browse/MESOS-1388
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Vinod Kone
>            Assignee: Benjamin Mahler
>
> The following is a sequence of events that could result in master sending TASK_LOST and then TASK_FINISHED for a task to a framework.
> --> Master failed over
> --> Slaves tries to re-register with Master w/ a running task (T)
> --> Master starts re-admission into the registry
> --> Task finishes and slave removes it from its map
> --> The TASK_FINISHED status update is dropped by master as re-admission is in progress
> --> The executor terminates on the slave.
> --> Slave retries re-registration (w/o task T) as master is still busy re-admitting it and hasn't ACKed the re-registration yet
> --> Master finally finishes re-admission and re-adds slave with task T
> --> Master gets a duplicate/enqueued re-registration request (w/o task T) that results in the master sending TASK_LOST during reconciliation.
> --> Master now gets retried TASK_FINISHED update from the slave which it forwards to the scheduler.
> Normally, the slave re-registers and includes terminal unacknowledged tasks in the message to the master. However, when the executor is terminated, the slave does not send any of its tasks. This is problematic when there are unacknowledged updates for tasks ran by the executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)