You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2014/05/06 02:50:15 UTC
[jira] [Commented] (MESOS-783) Master::killTask must not answer
with TASK_LOST when the task is unknown.
[ https://issues.apache.org/jira/browse/MESOS-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990150#comment-13990150 ]
Vinod Kone commented on MESOS-783:
----------------------------------
can this be closed now?
> Master::killTask must not answer with TASK_LOST when the task is unknown.
> -------------------------------------------------------------------------
>
> Key: MESOS-783
> URL: https://issues.apache.org/jira/browse/MESOS-783
> Project: Mesos
> Issue Type: Sub-task
> Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.15.0
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Priority: Critical
> Labels: twitter
> Fix For: 0.19.0
>
>
> When the Master is asked to kill a task and it knows of the framework but it cannot locate the TaskID, the Master replies with TASK_LOST.
> This is normally ok, however, consider a failed over Master:
> --> Master fails over.
> --> Framework F re-registers.
> --> Slave with Task T in TASK_RUNNING has not yet re-registered.
> --> Master::killTask(F, T) cannot find T and replies with TASK_LOST.
> --> Slave re-registers with Task T in TASK_RUNNING.
> --> Now we've told the framework the task was LOST but it is left RUNNING.
> The simple fix here is to simply not reply in such cases and rely on a later reconciliation request.
> In the presence of a stateful master (MESOS-764), we can reliably reply with TASK_LOST if the slave is not in the Registrar, otherwise we must remain silent as the slave will be possibly re-registering with the correct state of the TASK. Ideally we can postpone the kill task message for the slave so that once it re-registers we can send it, but this is a bit complicated to implement and reconciliation can help with this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)