You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Charles Natali (Jira)" <ji...@apache.org> on 2020/05/06 20:32:00 UTC
[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

    [ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101162#comment-17101162 ] 

Charles Natali commented on MESOS-8038:
---------------------------------------

The more I think about it the more I think that the current behavior of optimistically releasing the resources is very sub-optimal.

We've had cgroup destruction fail for various reasons in our cluster:
 * kernel bugs - see https://issues.apache.org/jira/browse/MESOS-10107
 * tasks stuck in uninterruptible sleep, e.g. NFS I/O

When this happens, it triggers at least the following problems:
 * this issue with GPUs, which cause all subsequent tasks scheduled on the host trying to use the GPU to fail, effectively a black hole
 * another problem where some stacks stuck in uninterruptible sleep were still consuming memory, so the agent overcommitted memory causing tasks to run OOM further down the line

 

"Leaking" CPU is mostly fine because it's a compressible resource and stuck tasks generally don't use it, but it's pretty bad for memory and GPU, causing errors which are hard to diagnose and automatically recover from.

 

 

> Launching GPU task sporadically fails.
> --------------------------------------
>
>                 Key: MESOS-8038
>                 URL: https://issues.apache.org/jira/browse/MESOS-8038
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, gpu
>    Affects Versions: 1.4.0
>            Reporter: Sai Teja Ranuva
>            Assignee: Zhitao Li
>            Priority: Critical
>         Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens even before the the job starts. A little search in the code base points me to something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)