You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (JIRA)" <ji...@apache.org> on 2019/04/01 10:30:03 UTC
[jira] [Comment Edited] (YARN-9430) Recovering containers does not check available resources on node

    [ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806627#comment-16806627 ] 

Szilard Nemeth edited comment on YARN-9430 at 4/1/19 10:29 AM:
---------------------------------------------------------------

Hi [~adam.antal]!

Thanks for the comment and mentioning these scenarios!

In general, I would like to involve more experienced people to help with the decision: [~tangzhankun], [~sunilg], [~leftnoteasy], [~wilfreds]

After our offline discussion with [~adam.antal] and [~shuzirra], we have the following questions: 
 # Should NM kill containers when mapped GPU devices are not present (or if there's not enough resources)?
 For example: Requested 1 GPU but no GPU is available, what should happen? 
 For this to decide, it's crucial to know what is the motivation behind saving container-GPU device mappings into the state store.
 AFAIK, NM is assigning containers to "random" GPU devices. If the mapping between GPU and the container does not matter while the container starts, why does it matter during recovery?
 # Do you agree to kill containers on NM-side if there's not enough resources for the container? AM should handle if any container is lost anyways.
 # If an assigned GPU (GPU #1) is offline after recovery but another GPU (GPU #2) is available, what should NM do? Should it allocate GPU #2 for the container?
 The answer for this question is also affected by the decision about whether we keep the GPU-container mappings or not.  

Thanks!


was (Author: snemeth):
Hi [~adam.antal]!

Thanks for the comment and mentioning these scenarios!

In general, I would like to involve more experienced people to help with the decision: [~tangzhankun], [~sunilg], [~leftnoteasy], [~wilfreds]

After our offline discussion with [~adam.antal] and [~shuzirra], we have the following questions: 
 # Should NM kill containers when mapped GPU devices are not present (or if there's not enough resources)?
 For example: Requested 1 GPU but no GPU is available, what should happen? 
 For this to decide, it's crucial to know what is the motivation behind saving container-GPU device mappings into the state store.
 AFAIK, NM is assigning containers to "random" GPU devices. If the mapping between GPU and the container does not matter while the container starts, why does it matter during recovery?
 # Do you agree to kill containers on NM-side if there's not enough resources for the container? AM should handle if any container is lost anyways.


 # If an assigned GPU (GPU #1) is offline after recovery but another GPU (GPU #2) is available, what should NM do? Should it allocate GPU #2 for the container?
 The answer for this question is also affected by the decision about whether we keep the GPU-container mappings or not.  

Thanks!

> Recovering containers does not check available resources on node
> ----------------------------------------------------------------
>
>                 Key: YARN-9430
>                 URL: https://issues.apache.org/jira/browse/YARN-9430
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery happens, only the containers that fit into the node's resources will be recovered. Unfortunately, this is not the case: RM does not check available resources on node during recovery.
> *Detailed explanation:*
> *Testcase:* 
>  1. There are 2 nodes running NodeManagers
>  2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices per node, initially. This means 4 GPU devices in the cluster altogether.
>  3. RM / NM recovery is enabled
>  4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for each (AM does not request GPUs)
>  5. Before restart, the fake bash script is adjusted to report 1 GPU device per node (2 in the cluster) after restart.
>  6. Restart is initiated.
>  
> *Expected behavior:* 
>  After restart, only the AM and 2 normal containers should have been started, as there are only 2 GPU devices in the cluster.
>  
> *Actual behaviour:* 
>  AM + 4 containers are allocated, this is all containers started originally with step 4.
> App id was: 1553977186701_0001
> *Logs*:
>  
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1553977186701_0001_000001 of type RECOVER
> 2019-03-30 13:22:30,366 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler from user: systest
>  2019-03-30 13:22:30,366 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: appattempt_1553977186701_0001_000001 is recovering. Skipping notifying ATTEMPT_ADDED
>  2019-03-30 13:22:30,367 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on event = RECOVER
> 2019-03-30 13:22:33,257 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000001, CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000004, CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000004 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after allocation
> 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000005, CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_000005 of type RECOVER
>  2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to RUNNING
>  2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocation
> 2019-03-30 13:22:33,279 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000003, CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_000003 of type RECOVER
>  2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to RUNNING
>  2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
>  2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000003 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1> available after allocation
>  2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering container container_e84_1553977186701_0001_01_000003
> {code}
>  
> There are multiple logs like this:
> {code:java}
> Assigned container container_e84_1553977186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocation{code}
> *Note the -1 value for the yarn.io/gpu resource!*
> The issue lies in this method: [https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179]
> The problem is that method deductUnallocatedResource does not check if the resource of the container is subtracted from unallocated resource, the unallocated resource remains above zero.
>  Here is the ResourceManager call hierarchy for the method (from top to bottom):
> {code:java}
> 1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle
> 2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode
> 3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode
> 4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer
> 5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer
> 6. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer, boolean)
> deduct is called here!{code}
> *Testcase that reproduces the issue:* 
>  *Add this testcase to TestFSSchedulerNode:*
>  
> {code:java}
> @Test
>  public void testRecovery() {
>  RMNode node = createNode();
>  FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false);
> RMContainer container1 = createContainer(Resource.newInstance(4096, 4),
>  null);
>  RMContainer container2 = createContainer(Resource.newInstance(4096, 4),
>  null);
>  
>  schedulerNode.allocateContainer(container1);
>  schedulerNode.containerStarted(container1.getContainerId());
>  schedulerNode.allocateContainer(container2);
>  schedulerNode.containerStarted(container2.getContainerId());
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  RMContainer container3 = createContainer(Resource.newInstance(1000, 1),
>  null);
>  when(container3.getState()).thenReturn(RMContainerState.NEW);
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  
>  schedulerNode.recoverContainer(container3);
> assertEquals("No resource should have been unallocated",
>  Resources.none(), schedulerNode.getUnallocatedResource());
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  }
> {code}
>  
>  
> *Result of testcase:*
> {code:java}
> java.lang.AssertionError: No resource should have been unallocated 
> Expected :<memory:0, vCores:0>
> Actual :<memory:-1000, vCores:-1>{code}
> *IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY RESOURCES ARE AFFECTED BY THIS ISSUE!*
>  
> *Possible fix:* 
>  1. A condition needs to be introduced to check if there is enough resources on the node, we should proceed with the container's recovery only if this is true.
>  2. An error log should be added. For a quick look, this is seemingly enough so no exception is required, but this needs a more thorough investigation and manual test on cluster!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org