You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2016/10/26 17:52:58 UTC

[jira] [Commented] (MESOS-6483) Check failure when a 1.1 master marking a 0.28 agent as unreachable

    [ https://issues.apache.org/jira/browse/MESOS-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609167#comment-15609167 ] 

Yan Xu commented on MESOS-6483:
-------------------------------

[~neilc] how should we handle an unreachable 0.28 agent with 1.1 master? This can happen when you upgrade directly from 0.28 to 1.1, which in theory we should (try our best to) support. Upgrading agents first (so it becomes 1.1 agents with 0.28 master) sounds workable but if so we probably should add a section in upgrades.md?

> Check failure when a 1.1 master marking a 0.28 agent as unreachable
> -------------------------------------------------------------------
>
>                 Key: MESOS-6483
>                 URL: https://issues.apache.org/jira/browse/MESOS-6483
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Megha
>
> When upgrading directly from mesos version 0.28 to a version > 1.0 there could be a scenario that may make the CHECK(frameworks.recovered.contains(frameworkId)) in Master::_markUnreachable(..) to fail. The following sequence of events can happen.
> 1) The master gets upgraded first to the new version and the agent lets say X is still at mesos version 0.28
> 2) This agent X (at mesos 0.28) attempts to re-registers with the master (at lets say 1.1) and as a result doesn't send the frameworks (frameworkInfos) in the ReRegisterSlave message since it wasn't available in the older mesos version.
> 3) Among other frameworks on this agent X, is a framework Y which didn’t re-register after master’s failover. Since the master builds the frameworks.recovered from the frameworkInfos that agents provide it so this framework Y is neither in the recovered nor in registered frameworks.
> 4) The agent X post re-registering fails master’s health check and is being marked unreachable by the master. The check  CHECK(frameworks.recovered.contains(frameworkId)) will get fired for the framework Y since it is neither in recovered or registered but has tasks running on the agent X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)