You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jan Ulferts (Jira)" <ji...@apache.org> on 2020/10/05 16:38:00 UTC

[jira] [Commented] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support

    [ https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208177#comment-17208177 ] 

Jan Ulferts commented on MESOS-10192:
-------------------------------------

Hey just some info from my analysis of this:

With cuda 11 {{/dev/nvidia-caps}} folder was added. The code at https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454 expects device files and not folders. IMO this is a Bug and we should add support for nvidia-caps or at least have the isolator ignore folders. 

For now in DC/OS i've pinned cuda and nvidia-driver so nvidia-caps isn't available. But thats only a short term workaround

> Recent Nvidia CUDA changes break Mesos GPU support
> --------------------------------------------------
>
>                 Key: MESOS-10192
>                 URL: https://issues.apache.org/jira/browse/MESOS-10192
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization, gpu
>            Reporter: Greg Mann
>            Priority: Major
>              Labels: GPU, containerization, containerizer, gpu
>
> Recently it seems that the layout of the Nvidia device files has changed:  https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
> This prevents GPU tasks from launching:
> {noformat}
> W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a special file: /dev/nvidia-caps
> {noformat}
> due to this code, which detects the nvidia device files: https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)