You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Justin Pinkul <jp...@live.com> on 2016/09/12 23:48:56 UTC

GPU containerizer post mortem

Hi everyone,
I just ran into a very subtle configuration problem with enabling GPU support on Mesos and thought I'd share a brief post mortem.
Scenario:Running a GPU Mesos task. This task first executes nvidia-smi to confirm the GPUs are visible and then executes a Caffe training example to verify the GPU is usable.
Symptom:The nvidia-smi reported the correct number of GPUs but the training example crashed when creating the CUDA device.
Debugging tactics:To debug this I added an infinite loop to the end of the task so the environment would not be torn down. Next I logged into the machine, found the PID of the Mesos task and entered the namespace with: nsenter -t $TASK_PID -m -u -i -n -p -r -w

At this point I attempted to manually run the test and it worked. The reason it worked was that my test terminal was not added to the devices CGROUP. So next I added it to the CGROUP with:echo $TEST_TERMINAL_PID >> /sys/fs/cgroup/devices/mesos/f6736041-d403-4494-95fd-604eace34ce1/tasks
After joining the CGROUP I could reproduce the problem and systematically added devices to the CGROUP's allow list until it worked.
Root cause:After rebooting a machine the nvidia-uvm device is not created automatically. To create this device "sudo mknod -m 666 /dev/nvidia-uvm c 250 0" was added to a start up script. The problem with this is that nvidia-uvm uses a major device ID in the experimental range. One of the consequences of this is that the major device ID might change on boot. This means the hardcoded value of 250 in the start up script is incorrect. When Mesos starts up it reads the major device ID from /dev/nvidia-uvm which matched the value given by the start up script. Then when it created the devices CGROUP it uses that number instead of the correct one. nvidia-smi worked because it never accessed the nvidia-uvm device.
The fix:Do not hard code the major device ID of nvidia-uvm in a start up script. Instead bring the device up with:nvidia-modprobe -u -c 0

I hope this information helps someone and a big thanks to Kevin Klues for helping me debug this issue.
Justin