You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2016/10/26 17:37:58 UTC

[jira] [Updated] (MESOS-6483) Check failure when a 1.1 master marking a 0.28 agent as unreachable

     [ https://issues.apache.org/jira/browse/MESOS-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yan Xu updated MESOS-6483:
--------------------------
    Summary: Check failure when a 1.1 master marking a 0.28 agent as unreachable  (was: Potential issue with upgrading from mesos 0.28 to mesos > 1.0)

> Check failure when a 1.1 master marking a 0.28 agent as unreachable
> -------------------------------------------------------------------
>
>                 Key: MESOS-6483
>                 URL: https://issues.apache.org/jira/browse/MESOS-6483
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Megha
>
> When upgrading directly from mesos version 0.28 to a version > 1.0 there could be a scenario that may make the CHECK(frameworks.recovered.contains(frameworkId)) in Master::_markUnreachable(..) to fail. The following sequence of events can happen.
> 1) The master gets upgraded first to the new version and the agent lets say X is still at mesos version 0.28
> 2) This agent X (at mesos 0.28) attempts to re-registers with the master (at lets say 1.1) and as a result doesn't send the frameworks (frameworkInfos) in the ReRegisterSlave message since it wasn't available in the older mesos version.
> 3) Among other frameworks on this agent X, is a framework Y which didn’t re-register after master’s failover. Since the master builds the frameworks.recovered from the frameworkInfos that agents provide it so this framework Y is neither in the recovered nor in registered frameworks.
> 4) The agent X post re-registering fails master’s health check and is being marked unreachable by the master. The check  CHECK(frameworks.recovered.contains(frameworkId)) will get fired for the framework Y since it is neither in recovered or registered but has tasks running on the agent X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)