You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (JIRA)" <ji...@apache.org> on 2019/03/27 21:09:00 UTC
[jira] [Assigned] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

     [ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Szilard Nemeth reassigned YARN-9421:
------------------------------------

    Assignee:     (was: Szilard Nemeth)

> Implement SafeMode for ResourceManager by defining a resource threshold
> -----------------------------------------------------------------------
>
>                 Key: YARN-9421
>                 URL: https://issues.apache.org/jira/browse/YARN-9421
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Szilard Nemeth
>            Priority: Major
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission failed in validating AM resource request for application application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[gpu], Requested resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed allocation=<memory:8192, vCores:1>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code}
> It's clearly visible that the maximum allowed allocation does not have any "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes out of a 100, we can quickly run into a situation when the NMs with GPUs are registering later than the normal nodes. While the critical NMs are still registering, we will most likely experience the same InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves and put submitted applications on hold. This could work in some situations but it's not the most flexible solution as different clusters can have different requirements. Of course, we can make this more flexible by making the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached this threshold, we put submitted jobs on hold. Once we reached the threshold, we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  6. Start RM on Node1
>  7. Start NM on Node3 (the one without the resource)
>  8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
>  
> *Configurations*: 
>  node1: yarn-site.xml of ResourceManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>{code}
> node2: yarn-site.xml of NodeManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>
> <property>
>  <name>yarn.nodemanager.resource-type.testres</name>
>  <value>1</value>
> </property>{code}
> node3: yarn-site.xml of NodeManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>{code}
> Please see full process logs from RM, NM, YARN-client attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org