You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (JIRA)" <ji...@apache.org> on 2018/06/14 06:15:00 UTC

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

    [ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512019#comment-16512019 ] 

Wangda Tan commented on YARN-8423:
----------------------------------

Thanks [~ssathish@hortonworks.com] for filing the issue.

I took a while to check the issue, it seems YARN NM takes more than 2 mins to kill a running container. Which causes the issue in description.

Here's a possible order to trigger the issue:

1) Container_1 has GPU resource on node_1
2) Application of container_1 got killed from RM. (So scheduler think the container is released).
3) Container_2 from another app got allocated on node_1 with some GPU resources. 
4) At the same time, RM notifies node_1 to kill container_1.
5) Because of some reason, container_1 not got killed immediately. (In the failed job, the container got killed after 2 mins!)
6) container_2 launch request arrives to node_1 before container_1 got killed. 
7) container_2 failed to launch because GPU resources is not marked as released in NM side.

This issue is not only related to GPU, but the GPU fails fast since it needs hard binding to GPU cores. I think we may need to revisit NM container launch behavior, in extreme cases, memory of NM could be overcommitted if new container arrives before old container fully killed.

cc: [~sunil.govind@gmail.com]

> GPU does not get released even though the application gets killed.
> ------------------------------------------------------------------
>
>                 Key: YARN-8423
>                 URL: https://issues.apache.org/jira/browse/YARN-8423
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Sumana Sathish
>            Assignee: Wangda Tan
>            Priority: Critical
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not being released.
> {code}
>  curl -i <NM>/ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":"<version>"},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_<containerID>"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org