You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2015/06/24 02:08:42 UTC

[jira] [Assigned] (MESOS-2919) Framework can overcommit oversubscribable resources during master failover.

     [ https://issues.apache.org/jira/browse/MESOS-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Yu reassigned MESOS-2919:
-----------------------------

    Assignee: Jie Yu

> Framework can overcommit oversubscribable resources during master failover.
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-2919
>                 URL: https://issues.apache.org/jira/browse/MESOS-2919
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jie Yu
>            Assignee: Jie Yu
>            Priority: Critical
>              Labels: twitter
>
> This is due to a bug in the hierarchical allocator. Here is the sequence of events:
> 1) slave uses a fixed resource estimator which advertise 4 revocable cpus
> 2) a framework A launches a task that uses all the 4 revocable cpus
> 3) master fails over
> 4) slave re-registers with the new master, and sends UpdateSlaveMessage with 4 revocable cpus as oversubscribed resources
> 5) framework A hasn't registered yet, therefore, the slave's available resources will be 4 revocable cpus
> 6) framework A registered and will receive an additional 4 revocable cpus. So it can launch another task with 4 revocable cpus (that means 8 total!)
> The problem is due to the way we calculate 'allocated' resource in allocator when 'updateSlave'. If the framework is not registered, the 'allocation' below is not accurate (check that if block in 'addSlave').
> {code}
> template <class RoleSorter, class FrameworkSorter>
> void
> HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::updateSlave(
>     const SlaveID& slaveId,
>     const Resources& oversubscribed)
> {
>   CHECK(initialized);
>   CHECK(slaves.contains(slaveId));
>   // Check that all the oversubscribed resources are revocable.
>   CHECK_EQ(oversubscribed, oversubscribed.revocable());
>   // Update the total resources.
>   // First remove the old oversubscribed resources from the total.
>   slaves[slaveId].total -= slaves[slaveId].total.revocable();
>   // Now add the new estimate of oversubscribed resources.
>   slaves[slaveId].total += oversubscribed;
>   // Now, update the total resources in the role sorter.
>   roleSorter->update(
>       slaveId,
>       slaves[slaveId].total.unreserved());
>   // Calculate the current allocation of oversubscribed resources.
>   Resources allocation;
>   foreachkey (const std::string& role, roles) {
>     allocation += roleSorter->allocation(role, slaveId).revocable();
>   }
>   // Update the available resources.
>   // First remove the old oversubscribed resources from available.
>   slaves[slaveId].available -= slaves[slaveId].available.revocable();
>   // Now add the new estimate of available oversubscribed resources.
>   slaves[slaveId].available += oversubscribed - allocation;
>   LOG(INFO) << "Slave " << slaveId << " (" << slaves[slaveId].hostname
>             << ") updated with oversubscribed resources " << oversubscribed
>             << " (total: " << slaves[slaveId].total
>             << ", available: " << slaves[slaveId].available << ")";
>   allocate(slaveId);
> }
> template <class RoleSorter, class FrameworkSorter>
> void
> HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::addSlave(
>     const SlaveID& slaveId,
>     const SlaveInfo& slaveInfo,
>     const Resources& total,
>     const hashmap<FrameworkID, Resources>& used)
> {
>   CHECK(initialized);
>   CHECK(!slaves.contains(slaveId));
>   roleSorter->add(slaveId, total.unreserved());
>   foreachpair (const FrameworkID& frameworkId,
>                const Resources& allocated,
>                used) {
>     if (frameworks.contains(frameworkId)) {
>       const std::string& role = frameworks[frameworkId].role;
>       // TODO(bmahler): Validate that the reserved resources have the
>       // framework's role.
>       roleSorter->allocated(role, slaveId, allocated.unreserved());
>       frameworkSorters[role]->add(slaveId, allocated);
>       frameworkSorters[role]->allocated(
>           frameworkId.value(), slaveId, allocated);
>     }
>   }
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)