You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Kostiantyn Bokhan (JIRA)" <ji...@apache.org> on 2016/12/23 16:58:58 UTC

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

    [ https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773271#comment-15773271 ] 

Kostiantyn Bokhan commented on AURORA-1763:
-------------------------------------------

PATH and LD_LIBRARY_PATH are not  transferred  form a image as well. So, 

{code}NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.{code}



> GPU drivers are missing when using a Docker image
> -------------------------------------------------
>
>                 Key: AURORA-1763
>                 URL: https://issues.apache.org/jira/browse/AURORA-1763
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.16.0
>            Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified containerizer the Nvidia drivers are not correctly mounted. As an experiment I launched a task using both mesos-execute and Aurora using the same Docker image and ran nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder was not being mounted properly. To confirm this was the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to the Docker image. When this was done the task was able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - cgroup cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc binfmt_misc rw
> 98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 99 98 8:17 /mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 100 99 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/sandbox /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs/mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 67 78 0:33 / /run/user/1001 rw,nosuid,nodev,relatime master:26 - tmpfs tmpfs rw,size=13219080k,mode=700,uid=1001,gid=1001{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)