You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Why Cozy <do...@gmail.com> on 2020/03/30 20:56:26 UTC

When network partition happens, a YARN application fails to activate saying "Queue's AM resource limit exceeded"

Dear Hadoop Developers,

We encountered the following failure when a network partition occurs.
Here's what has happened:
1. Start a YARN (3.3.0) cluster with two RMs (RM1 and RM2), and one NM.
2. Make RM1 active.
3. Start an example sleeper YARN service (named sleeper1).
4. Failover from RM1 to RM2.
5. Verify that the sleeper1 restarts. *==> NM's network starts to fail*
6. Stop sleeper 1.
6. Start an example sleeper YARN service (named sleeper2).
7. Sleeper2 fails to start. The diagnosis says:

Application is added to the scheduler and is not yet activated. Queue's AM
resource limit exceeded.  Details : AM Partition = <DEFAULT_PARTITION>; AM
Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM =
<memory:1024, vCores:1>; User AM Resource Limit of the queue =
<memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>;


If the network partition does not happen, sleeper2 can start successfully.
Then, why does the diagnosis complain about resource limit when there's a
network failure? Is this a bug in YARN? Or, am I missing something?


Thanks!