You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (Jira)" <ji...@apache.org> on 2020/12/11 18:14:00 UTC
[jira] [Commented] (YARN-10530) CapacityScheduler ResourceLimits
doesn't handle node partition well
[ https://issues.apache.org/jira/browse/YARN-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248084#comment-17248084 ]
Wangda Tan commented on YARN-10530:
-----------------------------------
cc: [~sunilg], [~epayne]
> CapacityScheduler ResourceLimits doesn't handle node partition well
> -------------------------------------------------------------------
>
> Key: YARN-10530
> URL: https://issues.apache.org/jira/browse/YARN-10530
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler, capacityscheduler
> Reporter: Wangda Tan
> Priority: Blocker
>
> This is a serious bug may impact all releases, I need to do further check but I want to log the JIRA so we will not forget:
> ResourceLimits objects are used to handle two purposes:
> 1) When there's cluster resource change, for example adding new node, or scheduler config reinitialize. We will pass ResourceLimits to updateClusterResource to queues.
> 2) When allocate container, we try to pass parent's available resource to child to make sure child's resource allocation won't violate parent's max resource. For example below:
> {code}
> queue used max
> --------------------------------------
> root 10 20
> root.a 8 10
> root.a.a1 2 10
> root.a.a2 6 10
> {code}
> Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at most allocate 2 resources to a1 because root.a's limit will hit first. This information will be passed down from parent queue to child queue during assignContainers call via ResourceLimits.
> However, we only pass 1 ResourceLimits from top, for queue initialize, we passed in:
> {code}
> root.updateClusterResource(clusterResource, new ResourceLimits(
> clusterResource));
> {code}
> And when we update cluster resource, we only considered default partition
> {code}
> // Update all children
> for (CSQueue childQueue : childQueues) {
> // Get ResourceLimits of child queue before assign containers
> ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
> clusterResource, resourceLimits,
> RMNodeLabelsManager.NO_LABEL, false);
> childQueue.updateClusterResource(clusterResource, childLimits);
> }
> {code}
> Same for allocation logic, we passed in: (Actually I found I added a TODO item 5 years ago).
> {code}
> // Try to use NON_EXCLUSIVE
> assignment = getRootQueue().assignContainers(getClusterResource(),
> candidates,
> // TODO, now we only consider limits for parent for non-labeled
> // resources, should consider labeled resources as well.
> new ResourceLimits(labelManager
> .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
> getClusterResource())),
> SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
> {code}
> The good thing is, in the assignContainers call, we calculated child limit based on partition
> {code}
> ResourceLimits childLimits =
> getResourceLimitsOfChild(childQueue, cluster, limits,
> candidates.getPartition(), true);
> {code}
> So I think now the problem is, when a named partition has more resource than default partition, effective min/max resource of each queue could be wrong.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org