You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (JIRA)" <ji...@apache.org> on 2018/05/13 10:19:00 UTC

[jira] [Comment Edited] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

    [ https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473426#comment-16473426 ] 

Szilard Nemeth edited comment on YARN-8248 at 5/13/18 10:18 AM:
----------------------------------------------------------------

Hi [~haibochen]

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
        LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
      }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAMResourceRequests != null) 
{code}
since rmApp should be non-null at this point.
 What do you prefer?

 

2. It is true that {{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would always return false when the {{queueMaxShare}} is 0 for any resource, but the problem with just using {{Resources.fitsIn}} is that it would return false for such cases when the requested resource is smaller than the max resource but that max resource is not zero, e.g. requested vCores = 2, max vCores = 1.
 With this check, I only wanted to catch those cases where there is a resource request of any resource type but the queue has 0 of that resource in {{queueMaxShare}}.
 In this sense, in the if condition this check would be enough:
{code:java}
Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare)
{code}
but it is not perfectly fine since only using this check does not check whether a resource is really requested. For example, an application does not request any vCores (maybe this cannot happen in reality) and we have 0 of vCores as maximum then it is a perfectly reasonable request so we don't need to reject the application. On the other hand if an app requests 1 vCores and we have 0 vCores as maximum then rejection should happen.
 Is this explanation makes it cleaner?
 Do you think some comments need to be added to the code above the if condition?
 How would you update the diagnostic message?

 

3. My overall intention of my changes in {{Fairscheduler}} was the following: 
 Essentially, in {{addApplication()}}, the AM resource requests are checked against the queue's max resources.
 In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) resource request is happened against a queue that has 0 of any resource configured as max resource.
 So in my understanding, it can happen that in {{addApplication()}} the app was not rejected, for example AM does not request vCores and we have 0 vCores configure as max resources, but for a map container, 1 vCores is requested. 
 Please tell me whether this is clear.

 

4. 
 {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when AM resource request is exceeding the queue's maximum resources. (tests code added to {{FairScheduler.addApplication}})

{{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when map / reduce container request is exceeding the queue's maximum resources (tests code added to {{FairScheduler.allocate}})
 Please check my comment for 3. as I explained such a case when an application will not be rejected immadiately upon submission but only when map/reduce container request happens.

About the uncovered unit test: Good point and I was thinking about that if we can reject an application only if the AM request is greater than 0 and we have 0 configured as max resource or simply in any case where the requested resource is greater than max resource, regardless if it is 0 or not.

If the latter is true, then I agree, unit tests and the if-conditions in the production code needs to be changed accordingly (using just {{Resources.fitsIn}} will work I guess).

I'm fine with either way as well and as you have competence with FairScheduler please advise which way I should go.

5.
 - Removed the unused import.
 - Renamed those methods what you suggested
 - Thanks for the log change suggestions, you were right about those, it's way more understandable that way.

 

Thanks!


was (Author: snemeth):
Hi @haibo!

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
        LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
      }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null value for the AM requests, I would leave the second null check that is just before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAMResourceRequests != null) 
{code}
since rmApp should be non-null at this point.
 What do you prefer?

 

2. It is true that {{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would always return false when the {{queueMaxShare}} is 0 for any resource, but the problem with just using {{Resources.fitsIn}} is that it would return false for such cases when the requested resource is smaller than the max resource but that max resource is not zero, e.g. requested vCores = 2, max vCores = 1.
 With this check, I only wanted to catch those cases where there is a resource request of any resource type but the queue has 0 of that resource in {{queueMaxShare}}.
 In this sense, in the if condition this check would be enough:
{code:java}
Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare)
{code}
but it is not perfectly fine since only using this check does not check whether a resource is really requested. For example, an application does not request any vCores (maybe this cannot happen in reality) and we have 0 of vCores as maximum then it is a perfectly reasonable request so we don't need to reject the application. On the other hand if an app requests 1 vCores and we have 0 vCores as maximum then rejection should happen.
 Is this explanation makes it cleaner?
 Do you think some comments need to be added to the code above the if condition?
 How would you update the diagnostic message?

 

3. My overall intention of my changes in {{Fairscheduler}} was the following: 
 Essentially, in {{addApplication()}}, the AM resource requests are checked against the queue's max resources.
 In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) resource request is happened against a queue that has 0 of any resource configured as max resource.
 So in my understanding, it can happen that in {{addApplication()}} the app was not rejected, for example AM does not request vCores and we have 0 vCores configure as max resources, but for a map container, 1 vCores is requested. 
 Please tell me whether this is clear.

 

4. 
 {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when AM resource request is exceeding the queue's maximum resources. (tests code added to {{FairScheduler.addApplication}})

{{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an application happens when map / reduce container request is exceeding the queue's maximum resources (tests code added to {{FairScheduler.allocate}})
 Please check my comment for 3. as I explained such a case when an application will not be rejected immadiately upon submission but only when map/reduce container request happens.

About the uncovered unit test: Good point and I was thinking about that if we can reject an application only if the AM request is greater than 0 and we have 0 configured as max resource or simply in any case where the requested resource is greater than max resource, regardless if it is 0 or not.

If the latter is true, then I agree, unit tests and the if-conditions in the production code needs to be changed accordingly (using just {{Resources.fitsIn}} will work I guess).

I'm fine with either way as well and as you have competence with FairScheduler please advise which way I should go.

5.
 - Removed the unused import.
 - Renamed those methods what you suggested
 - Thanks for the log change suggestions, you were right about those, it's way more understandable that way.

 

Thanks!

> Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8248
>                 URL: https://issues.apache.org/jira/browse/YARN-8248
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, yarn
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-8248-001.patch, YARN-8248-002.patch, YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch, YARN-8248-006.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission since the specified queue cannot serve the resource needs of the submitted job.
>  
> Command to run:
> {code:java}
> bin/yarn jar "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  <queue name="sample_queue">
>     <minResources>10000 mb,0vcores</minResources>
>     <maxResources>90000 mb,0vcores</maxResources>
>     <maxRunningApps>50</maxRunningApps>
>     <maxAMShare>-1.0f</maxAMShare>
>     <weight>2.0</weight>
>     <schedulingPolicy>fair</schedulingPolicy>
>   </queue>
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is not yet activated. (Resource request: <memory:1536, vCores:1> exceeds current queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org