You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/10/01 01:06:23 UTC

[jira] [Created] (MESOS-711) Master::reconcile incorrectly recovers resources from reconciled tasks.

Benjamin Mahler created MESOS-711:
-------------------------------------

             Summary: Master::reconcile incorrectly recovers resources from reconciled tasks.
                 Key: MESOS-711
                 URL: https://issues.apache.org/jira/browse/MESOS-711
             Project: Mesos
          Issue Type: Bug
            Reporter: Benjamin Mahler
            Assignee: Benjamin Mahler
            Priority: Critical


The following sequence of events will over-subscribe a slave in the allocator:

--> Slave re-registers with the same master due to a slave restart. Tasks were running on the slave, but are lost in the process of the slave restarting.

--> As a result, the slave includes no task / executor information in it's re-registration message.

--> The slave is added back to the allocator with it's full resources, in Master::reregisterSlave():

      // If this is a disconnected slave, add it back to the allocator.
      if (slave->disconnected) {
        slave->disconnected = false; // Reset the flag.

        hashmap<FrameworkID, Resources> resources;
        foreach (const ExecutorInfo& executorInfo, executorInfos) {
          resources[executorInfo.framework_id()] += executorInfo.resources();
        }
        foreach (const Task& task, tasks) {
          // Ignore tasks that have reached terminal state.
          if (!protobuf::isTerminalState(task.state())) {
            resources[task.framework_id()] += task.resources();
          }
        }
        allocator->slaveAdded(slaveId, slaveInfo, resources);
      }

--> Now reconciliation occurs, and the master sends TASK_LOST messages for each slave through Master::statusUpdate, which results in a call to Allocator::resourcesRecovered!

--> Reconciliation also calls Allocator::resourcesRecovered for the unknown executors.

--> These two bugs result in the allocator offering more resources than the slave contains.

We can either change the re-registration code, or change the reconciliation code. The easiest fix here is to add the slave back taking into account the used resources from the slave *and the master's* information.



--
This message was sent by Atlassian JIRA
(v6.1#6144)