You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2016/12/01 15:55:58 UTC
[jira] [Commented] (MESOS-6250) Ensure valid task state before connecting with framework on master failover

    [ https://issues.apache.org/jira/browse/MESOS-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712324#comment-15712324 ] 

Neil Conway commented on MESOS-6250:
------------------------------------

{quote}
Now, on Mesos master failover, Mesos does not guarantee that it first re-registers with it slaves before it re-connects to a framework. So it can occur that the framework connects before Mesos has finished or started the re-registration with the slaves. When the framework then sends reconciliation requests directly after a re-registration Mesos will reply with status updates where the task state is wrong (TASK_LOST instead of TASK_RUNNING).
{quote}

This is not quite true. This is the scenario:

* Framework F runs task T on agent A. Master M1 is the current leading master.
* M1 fails over and M2 is elected as the new leading master.
* F re-registers with M2 and does explicit reconciliation for task T; agent A has not yet re-registered.

At this point, explicit reconciliation for {{T}} will _not_ return {{TASK_LOST}} -- it does not return anything, because the master gives the agent 10 minutes to re-register (see {{agent_reregister_timeout}} master flag). If the agent reregister timeout has expired, the master returns {{TASK_LOST}} -- but the task _will_ be terminated if/when the agent re-registers. (Returning TASK_LOST here is not "wrong", because it is consistent with what TASK_LOST means in general: it does _not_ mean that the task is definitely not running, but rather that the master has lost contact with the agent running the task.) For more on the behavior here, see https://mesos.apache.org/documentation/latest/high-availability-framework-guide/

There are several projects in progress to improve this behavior:

* If the framework is partition-aware (see MESOS-5344 and MESOS-6394), it will see a new set of task states that more precisely describe the master's knowledge of the state of the task (TASK_UNREACHABLE, TASK_UNKNOWN, TASK_GONE, TASK_DROPPED, etc.). An experimental version of this feature is available in Mesos 1.1, and will be improved in 1.2
* To clarify reconciliation behavior after master failover but before the {{agent_reregister_timeout}} has expired, we might change the master to instead send a different task state (e.g., {{TASK_FAILOVER_UNKNOWN}}). MESOS-4050.

Therefore I'm going to close this ticket, because I think the current issues are covered by other existing JIRAs. Please let me know if you disagree.

> Ensure valid task state before connecting with framework on master failover
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-6250
>                 URL: https://issues.apache.org/jira/browse/MESOS-6250
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.28.0, 0.28.1, 1.0.1
>         Environment: OS X 10.11.6
>            Reporter: Markus Jura
>            Priority: Minor
>
> During a Mesos master failover the master re-registers with its slaves to receive the current state of the running tasks. It also reconnects to a framework.
> In the documentation it is recommended that a framework performs an explicit task reconciliation when the Mesos master re-registers: http://mesos.apache.org/documentation/latest/reconciliation/
> When allowing a reconciliation of a framework, Mesos master should guarantee that its task state is valid, i.e. the same as on the slaves. Otherwise, Mesos can reply with status updates of state {{TASK_LOST}} even the tasks is still running on the slave.
> Now, on Mesos master failover, Mesos does not guarantee that it first re-registers with it slaves before it re-connects to a framework. So it can occur that the framework connects before Mesos has finished or started the re-registration with the slaves. When the framework then sends reconciliation requests directly after a re-registration Mesos will reply with status updates where the task state is wrong ({{TASK_LOST}} instead of {{TASK_RUNNING}}).
> For a reconciliation request, Mesos should guarantee that the task state is consistent with the slaves before it replies with a status update.
> Another possibility would be that Mesos sends a message to the framework once it has re-registered with the slaves so that the framework then starts the reconciliation. So far a framework can only delay the reconciliation for a certain amount of time. But it does not know how long the delay should be because Mesos is not notifying the framework when the task state is consistent again. 
> *Log: Mesos master - connecting with framework before re-registering with slaves*
> {code:bash}
> I0926 12:39:42.006933 4284416 detector.cpp:152] Detected a new leader: (id='92')
> I0926 12:39:42.007242 1064960 group.cpp:706] Trying to get '/mesos/json.info_0000000092' in ZooKeeper
> I0926 12:39:42.008129 4284416 zookeeper.cpp:259] A new leading master (UPID=master@127.0.0.1:5049) is detected
> I0926 12:39:42.008304 4284416 master.cpp:1847] The newly elected leader is master@127.0.0.1:5049 with id 96178e81-8371-48af-ba5e-c79d16c27fab
> I0926 12:39:42.008332 4284416 master.cpp:1860] Elected as the leading master!
> I0926 12:39:42.008349 4284416 master.cpp:1547] Recovering from registrar
> I0926 12:39:42.008488 3211264 registrar.cpp:332] Recovering registrar
> I0926 12:39:42.015935 4284416 registrar.cpp:365] Successfully fetched the registry (0B) in 7.426816ms
> I0926 12:39:42.015985 4284416 registrar.cpp:464] Applied 1 operations in 11us; attempting to update the 'registry'
> I0926 12:39:42.021425 4284416 registrar.cpp:509] Successfully updated the 'registry' in 5.426176ms
> I0926 12:39:42.021462 4284416 registrar.cpp:395] Successfully recovered registrar
> I0926 12:39:42.021581 528384 master.cpp:1655] Recovered 0 agents from the Registry (118B) ; allowing 10mins for agents to re-register
> I0926 12:39:42.299598 3747840 master.cpp:2424] Received SUBSCRIBE call for framework 'conductr' at scheduler-65610031-d679-49e5-b7bd-6068500d4674@192.168.2.106:65290
> I0926 12:39:42.299697 3747840 master.cpp:2500] Subscribing framework conductr with checkpointing disabled and capabilities [  ]
> I0926 12:39:42.300122 2674688 hierarchical.cpp:271] Added framework conductr
> I0926 12:39:42.954983 1601536 master.cpp:4787] Re-registering agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 12:39:42.955189 1064960 registrar.cpp:464] Applied 1 operations in 60us; attempting to update the 'registry'
> I0926 12:39:42.955893 1064960 registrar.cpp:509] Successfully updated the 'registry' in 649984ns
> I0926 12:39:42.956224 4284416 master.cpp:7447] Adding task c69df81e-35f4-4c2e-863b-4e9d5ae2e850 with resources mem(*):0 on agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1)
> I0926 12:39:42.956704 4284416 master.cpp:4872] Re-registered agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000]
> I0926 12:39:42.956751 1601536 hierarchical.cpp:478] Added agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500])
> I0926 12:39:42.956830 4284416 master.cpp:4940] Sending updated checkpointed resources  to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 12:39:42.957373 2138112 master.cpp:5725] Sending 1 offers to framework conductr (conductr) at scheduler-65610031-d679-49e5-b7bd-6068500d4674@192.168.2.106:65290
> I0926 12:39:42.957715 2138112 master.cpp:5002] Received update of agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) with total oversubscribed resources
> I0926 12:39:42.957948 4284416 hierarchical.cpp:542] Agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) updated with oversubscribed resources  (total: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000], allocated: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000])
> I0926 12:39:42.959281 2138112 master.cpp:3951] Processing DECLINE call for offers: [ 96178e81-8371-48af-ba5e-c79d16c27fab-O0 ] for framework conductr (conductr) at scheduler-65610031-d679-49e5-b7bd-6068500d4674@192.168.2.106:65290
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)