You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Benjamin Teke (Jira)" <ji...@apache.org> on 2023/01/02 16:17:00 UTC

[jira] [Commented] (YARN-11403) Decommission Node reduces the maximumAllocation and leads to Job Failure

    [ https://issues.apache.org/jira/browse/YARN-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653651#comment-17653651 ] 

Benjamin Teke commented on YARN-11403:
--------------------------------------

[~prabhujoseph], [~vinay._.devadiga] what is the expected end result here? Because the behaviour of maximum allocation changing with the number of nodes is by design so (just as for example queue capacities and every other limit derived from them changing with the removal of NMs). Should the app stay in running state until a preset time or until the NMs come back online?

> Decommission Node reduces the maximumAllocation and leads to Job Failure
> ------------------------------------------------------------------------
>
>                 Key: YARN-11403
>                 URL: https://issues.apache.org/jira/browse/YARN-11403
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.3.4
>            Reporter: Prabhu Joseph
>            Assignee: Vinay Devadiga
>            Priority: Major
>
> When a node is put into Decommission, ClusterNodeTracker updates the maximumAllocation to the totalResources in use from that node. This could lead to Job Failure (with below error message) when the Job requests for a container of size greater than the new maximumAllocation.
> {code:java}
> 22/11/03 10:55:02 WARN ApplicationMaster: Reporter thread fails 4 time(s) in a row.
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[vcores], Requested resource=<memory:896, max memory:2147483647, vCores:2, max vCores:2147483647>, maximum allowed allocation=<memory:896, vCores:1>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:122880, vCores:128>
> {code}
> *Repro:*
> 1. Cluster with two worker nodes - node1 and node2 each with YARN NodeManager Resource Memory 10GB and configured maxAllocation is 10GB.
> 2. Submit SparkPi Job (ApplicationMaster Size: 2GB, Executor Size: 4GB). Say ApplicationMaster (2GB) is launched on node1. 
> 3. Put both nodes into Decommission. This makes maxAllocation to come down to 2GB.
> 4. The SparkPi Job fails as it requests for Executor Size of 4GB whereas maxAllocation is only 2GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org