You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Michael Park (JIRA)" <ji...@apache.org> on 2017/05/09 23:28:04 UTC

[jira] [Created] (MESOS-7487) A framework upgrading into PARTITION_AWARE capability will continue to receive {{TASK_LOST}} on old agents.

Michael Park created MESOS-7487:
-----------------------------------

             Summary: A framework upgrading into PARTITION_AWARE capability will continue to receive {{TASK_LOST}} on old agents.
                 Key: MESOS-7487
                 URL: https://issues.apache.org/jira/browse/MESOS-7487
             Project: Mesos
          Issue Type: Bug
          Components: agent
    Affects Versions: 1.2.0, 1.1.0
            Reporter: Michael Park


Before 1.3.0, the master did not send a {{FrameworkInfo}} in the {{UpdateFrameworkMessage}}. In general, this means that a pre-1.3.0 agent will not have the {{FrameworkInfo}} updated when a framework changes their {{FrameworkInfo}}. In specific, if a framework upgrades into having a {{PARTITION_AWARE}} capability, the 1.1.x and 1.2.x agents will not be aware of the update, and incorrectly treat report {{TASK_LOST}} in some cases.

Note that the run task path is okay since the master sends the new {{FrameworkInfo}}. The instances that are incorrect have the following check:

{code}
      if (!protobuf::frameworkHasCapability(
              framework->info,  // This is the one in agent memory!
              FrameworkInfo::Capability::PARTITION_AWARE))
{code}

One solution is to backport the changes to {{UpdateFrameworkMessage}} to 1.1.x and 1.2.x, but only update the capabilities portion of the {{FrameworkInfo}}.

If we update the entire {{FrameworkInfo}}, 1.1.x agent will run into an issue where it doesn't know how to deal with changes to {{FrameworkInfo.roles}}. Frameworks changing their roles is a 1.3.x feature. Note that 1.2.x agent can handle the role changes correctly because of {{Resource.allocation_info}} that was introduced in multi-role support in 1.2.x.

Refer to MESOS-7460 for the potential issue with backporting to 1.1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)