You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Andrew Chung (Jira)" <ji...@apache.org> on 2021/11/11 16:15:00 UTC
[jira] [Created] (YARN-11002) Resource-aware Opportunistic Container Allocation

Andrew Chung created YARN-11002:
-----------------------------------

             Summary: Resource-aware Opportunistic Container Allocation
                 Key: YARN-11002
                 URL: https://issues.apache.org/jira/browse/YARN-11002
             Project: Hadoop YARN
          Issue Type: New Feature
          Components: container-queuing, resourcemanager
            Reporter: Andrew Chung
            Assignee: Andrew Chung


Currently, Centralized opportunistic container (OContainer) allocation only considers queue length on nodes.
However, based on production experience, we found that relying only on queue length for allocating OContainers can lead to very high queue times.

Namely, while we rely on queue length to sort the candidate nodes for OContainer allocation, the queue length used in computation can be stale/outdated, depending on node HB frequency, leading to an exceedingly high number of queued nodes. This excessive queueing can lead to queued vertices on the critical path of our jobs blocking the rest of the job's vertices. Here's an example illustrating what can happen:
Suppose an NM can run 3 containers, and has no allocated containers to start with.
AM -> RM: heartbeat, allocate 6 containers
RM.NodeQueueLoadMonitor: nothing allocated on NM now, mark 3 containers on NM as allocated to AM, 3 containers from AM outstanding.
RM -> AM: allocate 3 containers on NM
NM -> RM: heartbeat with no containers allocated
RM.NodeQueueLoadMonitor: mark 0 containers allocated on NM
AM -> NM: run 3 containers
NM ->AM : ACK, run 3 containers, no more capacity on NM
AM -> RM: heartbeat
RM.NodeQueueLoadMonitor: nothing allocated on NM now, mark 3 more containers on NM as allocated to AM (in reality, there should be 3 containers already allocated on NM, and the 3 marked in this heartbeat will be queued), no containers from AM outstanding.
RM -> AM: allocate the 3 remaining containers on NM

We've found that, if OContainers are unlucky, they can be co-located with vertices that run for an excessively long period of time, requiring them to wait a long time before they can begin to run. This led to certain jobs to experience a > 3x increase in median run time. In these cases, it is better for the OContainer to remain un-released at the RM until NMs report available resources.

To address this, we propose a new allocation policy {{QUEUE_LENGTH_THEN_RESOURCES}}, orthogonal to the original {{QUEUE_LENGTH}} policy, that considers the available node resources when allocating OContainers.
It first sorts the candidate nodes by queue length in an ascending fashion, then by available resources.
In the case where the maximum queue length on nodes is zero, it simply waits until a node reports available resources to run OContainers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org