You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2021/07/08 14:59:00 UTC

[jira] [Assigned] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

     [ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Bacsko reassigned YARN-10848:
-----------------------------------

    Assignee: Minni Mittal

> Vcore allocation problem with DefaultResourceCalculator
> -------------------------------------------------------
>
>                 Key: YARN-10848
>                 URL: https://issues.apache.org/jira/browse/YARN-10848
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>            Reporter: Peter Bacsko
>            Assignee: Minni Mittal
>            Priority: Major
>         Attachments: TestTooManyContainers.java
>
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
>     if (calculator.computeAvailableContainers(Resources
>             .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>         minimumAllocation) <= 0) {
>       LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>           + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
>     if (!Resources.fitsIn(rc, capability, totalResource)) {
>       LOG.warn("Node : " + node.getNodeID()
>           + " does not have sufficient resource for ask : " + pendingAsk
>           + " node total capability : " + node.getTotalResource());
>       // Skip this locality request
>       ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>           activitiesManager, node, application, schedulerKey,
>           ActivityDiagnosticConstant.
>               NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>               + getResourceDiagnostics(capability, totalResource),
>           ActivityLevel.NODE);
>       return ContainerAllocation.LOCALITY_SKIPPED;
>     }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
>     Resource capability = pendingAsk.getPerAllocationResource();
>     Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the problem. The root cause is that we pass the resource calculator to {{Resource.fitsIn()}}. Instead, we should use an overridden version, just like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>    // Can we allocate a container on this node?
>     if (Resources.fitsIn(capability, available)) {
>       // Inform the application of the new container for this request
>       RMContainer allocatedContainer =
>           allocate(type, node, schedulerKey, pendingAsk,
>               reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use {{Resources.fitsIn()}} without the calculator in {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org