You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2019/03/27 22:22:00 UTC

[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

    [ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803416#comment-16803416 ] 

Eric Yang commented on YARN-9421:
---------------------------------

There is a few corner cases to consider.  If the size of the YARN cluster changes frequently, safe mode mechanism might kick in at random time?  If the jobs are queued during safe mode, job queue tracking also increase memory usage of resource manager.  At some point, the queue size will be full because there is finite amount of tracking memory for resource manager.  

What happen if job queue length is full, and what happens if jobs take too long to start and missed SLA?  If job queue is full, and it falls back to the same type of error messages for showing resource unavailable.  It might be better to let client side retry decision kicking sooner rather than queuing and found out queue is full later.  Option 2 is a option to mask transient problem, but retry logic still depends on the client to make the right decision.  I think the default behavior does not need to change for production cluster, but option 2 is nice to have for improving user experience for testing cluster.

> Implement SafeMode for ResourceManager by defining a resource threshold
> -----------------------------------------------------------------------
>
>                 Key: YARN-9421
>                 URL: https://issues.apache.org/jira/browse/YARN-9421
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Szilard Nemeth
>            Priority: Major
>         Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission failed in validating AM resource request for application application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[gpu], Requested resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed allocation=<memory:8192, vCores:1>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code}
> It's clearly visible that the maximum allowed allocation does not have any "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes out of a 100, we can quickly run into a situation when the NMs with GPUs are registering later than the normal nodes. While the critical NMs are still registering, we will most likely experience the same InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves and put submitted applications on hold. This could work in some situations but it's not the most flexible solution as different clusters can have different requirements. Of course, we can make this more flexible by making the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached this threshold, we put submitted jobs on hold. Once we reached the threshold, we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  6. Start RM on Node1
>  7. Start NM on Node3 (the one without the resource)
>  8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
>  
> *Configurations*: 
>  node1: yarn-site.xml of ResourceManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>{code}
> node2: yarn-site.xml of NodeManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>
> <property>
>  <name>yarn.nodemanager.resource-type.testres</name>
>  <value>1</value>
> </property>{code}
> node3: yarn-site.xml of NodeManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>{code}
> Please see full process logs from RM, NM, YARN-client attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org