You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Andrei Sekretenko (Jira)" <ji...@apache.org> on 2019/10/21 18:09:00 UTC

[jira] [Commented] (MESOS-10015) HierarchicalAllocatorProcess::updateAvailable() can stall the allocator with a huge number of reservations on an agent.

    [ https://issues.apache.org/jira/browse/MESOS-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956346#comment-16956346 ] 

Andrei Sekretenko commented on MESOS-10015:
-------------------------------------------

https://issues.apache.org/jira/browse/MESOS-9942 and related work will fix the `total number of frameworks` part.

To fix the quadratic growth vs the reservations count, we can avoid using `Resources::operator +=`, `Resources::operator-=` and `Resources::contains()` for re-adding a slave to a framework sorter.


> HierarchicalAllocatorProcess::updateAvailable() can stall the allocator with a huge number of reservations on an agent.
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-10015
>                 URL: https://issues.apache.org/jira/browse/MESOS-10015
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.5.3, 1.6.2, 1.7.2, 1.8.1, 1.9.0
>            Reporter: Andrei Sekretenko
>            Assignee: Andrei Sekretenko
>            Priority: Critical
>              Labels: resource-management
>
> Currently, updateAvailable() called for a single-object Resources for a single framework on a single slave requires `(total number of frameworks) * (number of resource objects per this slave)^2` calls of `Resource::addable()`
> In a cluster with a large number of frameworks this results in severe degradation of allocator performance  when a bunch of RESERVE/UNRESERVE operations occurs for an agent with hundreds of unique resources. 
> On our testing cluster task we observed task scheduling delays up to 30 minutes due to allocator being occupied with processing UNRESERVE operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)