You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (Jira)" <ji...@apache.org> on 2020/12/11 18:14:00 UTC

[jira] [Commented] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well

    [ https://issues.apache.org/jira/browse/YARN-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248084#comment-17248084 ] 

Wangda Tan commented on YARN-10530:
-----------------------------------

cc: [~sunilg], [~epayne]

> CapacityScheduler ResourceLimits doesn't handle node partition well
> -------------------------------------------------------------------
>
>                 Key: YARN-10530
>                 URL: https://issues.apache.org/jira/browse/YARN-10530
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>            Reporter: Wangda Tan
>            Priority: Blocker
>
> This is a serious bug may impact all releases, I need to do further check but I want to log the JIRA so we will not forget:  
> ResourceLimits objects are used to handle two purposes: 
> 1) When there's cluster resource change, for example adding new node, or scheduler config reinitialize. We will pass ResourceLimits to updateClusterResource to queues. 
> 2) When allocate container, we try to pass parent's available resource to child to make sure child's resource allocation won't violate parent's max resource. For example below: 
> {code}
> queue         used  max
> --------------------------------------
> root          10    20
> root.a        8     10
> root.a.a1     2     10
> root.a.a2     6     10
> {code}
> Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at most allocate 2 resources to a1 because root.a's limit will hit first. This information will be passed down from parent queue to child queue during assignContainers call via ResourceLimits. 
> However, we only pass 1 ResourceLimits from top, for queue initialize, we passed in: 
> {code}
>     root.updateClusterResource(clusterResource, new ResourceLimits(
>         clusterResource));
> {code}
> And when we update cluster resource, we only considered default partition
> {code}
>       // Update all children
>       for (CSQueue childQueue : childQueues) {
>         // Get ResourceLimits of child queue before assign containers
>         ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
>             clusterResource, resourceLimits,
>             RMNodeLabelsManager.NO_LABEL, false);
>         childQueue.updateClusterResource(clusterResource, childLimits);
>       }
> {code}
> Same for allocation logic, we passed in: (Actually I found I added a TODO item 5 years ago).
> {code}
>     // Try to use NON_EXCLUSIVE
>     assignment = getRootQueue().assignContainers(getClusterResource(),
>         candidates,
>         // TODO, now we only consider limits for parent for non-labeled
>         // resources, should consider labeled resources as well.
>         new ResourceLimits(labelManager
>             .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
>                 getClusterResource())),
>         SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
> {code} 
> The good thing is, in the assignContainers call, we calculated child limit based on partition
> {code} 
> ResourceLimits childLimits =
>           getResourceLimitsOfChild(childQueue, cluster, limits,
>               candidates.getPartition(), true);
> {code} 
> So I think now the problem is, when a named partition has more resource than default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org