You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Wangda Tan (Jira)" <ji...@apache.org> on 2020/12/11 18:03:00 UTC

[jira] [Created] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well

Wangda Tan created YARN-10530:
---------------------------------

             Summary: CapacityScheduler ResourceLimits doesn't handle node partition well
                 Key: YARN-10530
                 URL: https://issues.apache.org/jira/browse/YARN-10530
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacity scheduler, capacityscheduler
            Reporter: Wangda Tan


This is a serious bug may impact all releases, I need to do further check but I want to log the JIRA so we will not forget:  

ResourceLimits objects are used to handle two purposes: 

1) When there's cluster resource change, for example adding new node, or scheduler config reinitialize. We will pass ResourceLimits to updateClusterResource to queues. 

2) When allocate container, we try to pass parent's available resource to child to make sure child's resource allocation won't violate parent's max resource. For example below: 

{code}
queue         used  max
--------------------------------------
root          10    20
root.a        8     10
root.a.a1     2     10
root.a.a2     6     10
{code}

Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at most allocate 2 resources to a1 because root.a's limit will hit first. This information will be passed down from parent queue to child queue during assignContainers call via ResourceLimits. 

However, we only pass 1 ResourceLimits from top, for queue initialize, we passed in: 

{code}
    root.updateClusterResource(clusterResource, new ResourceLimits(
        clusterResource));
{code}

And when we update cluster resource, we only considered default partition

{code}
      // Update all children
      for (CSQueue childQueue : childQueues) {
        // Get ResourceLimits of child queue before assign containers
        ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
            clusterResource, resourceLimits,
            RMNodeLabelsManager.NO_LABEL, false);
        childQueue.updateClusterResource(clusterResource, childLimits);
      }
{code}

Same for allocation logic, we passed in: (Actually I found I added a TODO item 5 years ago).

{code}
    // Try to use NON_EXCLUSIVE
    assignment = getRootQueue().assignContainers(getClusterResource(),
        candidates,
        // TODO, now we only consider limits for parent for non-labeled
        // resources, should consider labeled resources as well.
        new ResourceLimits(labelManager
            .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
                getClusterResource())),
        SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
{code} 

The good thing is, in the assignContainers call, we calculated child limit based on partition
{code} 
ResourceLimits childLimits =
          getResourceLimitsOfChild(childQueue, cluster, limits,
              candidates.getPartition(), true);
{code} 

So I think now the problem is, when a named partition has more resource than default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org