You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "YCozy (Jira)" <ji...@apache.org> on 2020/04/10 20:43:00 UTC

[jira] [Created] (YARN-10231) When a NM is partitioned away, YARN service will complain about "Queue's AM resource limit exceeded"

YCozy created YARN-10231:
----------------------------

             Summary: When a NM is partitioned away, YARN service will complain about "Queue's AM resource limit exceeded" 
                 Key: YARN-10231
                 URL: https://issues.apache.org/jira/browse/YARN-10231
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.3.0
            Reporter: YCozy


We were testing YARN's RM failover code under network partition, and we observed the following failure. We think this is a bug and would like to confirm with you.

Basically, we were testing the following scenario:
 # Start a YARN cluster with two RMs (e.g., RM1 and RM2) and one NM.
 # Make RM1 active.
 # Start a YARN service, e.g., the built-in sleeper service. Name it sleeper1.
 # Failover from RM1 to RM2.
 # Stop the sleeper1 and start another YARN service, e.g., still the sleeper service, and call it sleeper2.

When no network partition happens, everything is fine (e.g., sleeper2 can start successfully).

However, if the NM is partitioned after the RM failover, sleeper2 will fail to start: After polling sleeper2's status for 30 seconds, its application report is still as follows:
{code:java}
Application Report :
    Application-Id : application_4_0001
    Application-Name : sleeper2
    Application-Type : yarn-service
    User : root
    Queue : default
    Application Priority : 0
    Start-Time : 1585525063950
    Finish-Time : 0
    Progress : 0%
    State : ACCEPTED 
    Final-State : UNDEFINED 
    Tracking-URL : N/A 
    RPC Port : -1 
    AM Host : N/A Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds
    Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds 
    Log Aggregation Status : DISABLED
    Diagnostics : [Sun Mar 29 23:37:44 +0000 2020] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>;  
    Unmanaged Application : false 
    Application Node Label Expression : <Not set> 
    AM container Node Label Expression : <DEFAULT_PARTITION> 
    TimeoutType : LIFETIME ExpiryTime : UNLIMITED RemainingTime : -1seconds
{code}
Since the only fault happens is network partition, the "queue's AM resource limit" shouldn't be exceeded.

We can reliably reproduce this bug using our fault injection engine. Please let us know if you need any info for debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org