You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/05/21 22:00:38 UTC

[jira] [Comment Edited] (MESOS-1389) Reconciliation can send TASK_LOST before a terminal update reaches the framework.

    [ https://issues.apache.org/jira/browse/MESOS-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005133#comment-14005133 ] 

Benjamin Mahler edited comment on MESOS-1389 at 5/21/14 7:58 PM:
-----------------------------------------------------------------

Because the master does not handle status update acknowledgements, to have all acknowledgments going through the master we'll unfortunately need to release the change in 2 phases (with required deployment order) or 3 phases (no deployment order). This means that we can only fix the bug fully in 0.20.0 or 0.21.0.

h3. Two phases (required upgrade order, no full N and N+1 compatibility)
*Phase 1: 0.19.0*:
* Implement status update acknowledgement handling in the Master, but do not store terminated tasks.
* Ensure the Slave can handle acknowledgements coming from both the Master and the Scheduler Driver.
* Update the scheduler driver to send updates to the Master.

Upgrade order: (1) All Slaves and Masters (2) Frameworks. Note that a 0.18.0 slave would not work with a 0.19.0 master and framework. Likewise a 0.19.0 master would not work with a 0.18.0 framework.

*Phase 2: 0.20.0*:
* Update the master to store terminal un-acked tasks.

h3. Three phases (no upgrade order required)

*Phase 1: 0.19.0*:
* Implement status update acknowledgement handling in the Master, but do not store terminated tasks.
* Ensure the Slave can handle acknowledgements coming from both the Master and the Scheduler Driver.

*Phase 2: 0.20.0*:
* Update the scheduler driver to send updates to the Master.

*Phase 3: 0.21.0*:
* Update the master to store terminal un-acked tasks.

h3. Safety
To ensure that this is done correctly, the Slave must reject acknowledgments from the non-leading master. This is required to prevent the following:

1. Slave re-registers with the master with a terminal task T, since the terminal status has not been acknowledged.
2. The slave receives a stale acknowledgment message from the *old* master, stops retrying the update.
3. The leading master is stuck with the task T in its memory, since the slave will never retry the update.


was (Author: bmahler):
Because the master does not handle status update acknowledgements, to have all acknowledgments going through the master we'll unfortunately need to release the change in 2 phases. This means 0.19.0 will be stuck with this reconciliation bug:

*0.19.0*: Implement status update acknowledgement handling in the Master. Ensure the Slave can handle acknowledgements coming from both the Master and the Scheduler Driver. It is *key* that the slave ignores any acknowledgements coming from the non-leading Master (see below).

*0.20.0*: Update the scheduler driver to send status updates to the Master always.

To ensure that this is done correctly, the Slave must reject acknowledgments from the non-leading master. This is required to prevent the following:

1. Slave re-registers with the master with a terminal task T, since the terminal status has not been acknowledged.
2. The slave receives a stale acknowledgment message from the *old* master, stops retrying the update.
3. The leading master is stuck with the task T in its memory, since the slave will never retry the update.

> Reconciliation can send TASK_LOST before a terminal update reaches the framework.
> ---------------------------------------------------------------------------------
>
>                 Key: MESOS-1389
>                 URL: https://issues.apache.org/jira/browse/MESOS-1389
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Benjamin Mahler
>             Fix For: 0.19.0
>
>
> There's an unfortunate case with reconciliation, where we end up sending TASK_LOST first and then the slave sends the valid terminal status update.
> When the slave re-registers with terminal tasks that have un-acked updates. The master does not store these tasks. So while the slave still needs to send the terminal status updates, the master will reply with TASK_LOST for reconciliation.
> We may need to ensure that all status update acknowledgements go through the master to fix this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)