You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Dylan Bethune-Waddell (JIRA)" <ji...@apache.org> on 2016/10/22 21:01:58 UTC
[jira] [Comment Edited] (MESOS-6383) NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls?

    [ https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15598490#comment-15598490 ] 

Dylan Bethune-Waddell edited comment on MESOS-6383 at 10/22/16 9:01 PM:
------------------------------------------------------------------------

Hi Kevin,

First of all, I did just hit this one first. I will cross reference the NVML changelog and the Mesos code for additional symbols that might need a redundancy when I get a chance.

Second, I am not sure that {{nvidia-smi}} even tries to get the minor number in earlier versions, as from what I've read it essentially wraps the NVML library. The man page on our cluster for driver version 319.72 / CUDA 5.5 does not have the "GPU Attributes -> Minor Number" entry that is on the manpage for [later versions of the driver|http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf]. Might not be able to use the cuda runtime either, as CUDA and NVML can enumerate the device IDs differently according to [{{cuda-smi}}|https://github.com/al42and/cuda-smi] - that page also offers an anecdote about CUDA 7.0 onwards including the ability to set CUDA_DEVICE_ORDER=PCI_BUS_ID which "makes this tool slightly less useful", but the [{{nvidia-docker}}|https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation] explanation of GPU isolation indicates that the PCI Bus ordering may not be consistent with the device character file minor number anyways. I also don't like CUDA_VISIBLE_DEVICES being a factor, but perhaps I'm just being paranoid. The nvidia-smi manpage also offers that "It is recommended that users desiring consistency use either UUID or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent between reboots and board serial number might be shared between multiple GPUs on the same board". I'm not the best person to interpret what this all means, but these are the places I've been looking for reference.

To me this suggested that the best way might be to figure out which character device file in {{/dev/nvidia*}} maps to which PCI bus location to correlate the minor number with each GPUs UUID? I was quite sure that it would be easy to find a canonical way to just take the Major/Minor number of the {{/dev/nvidia1}} device file for example and figure out the PCI info for each device associated with that file - but no luck yet. Also the {{nvidia-modprobe}} project led me to believe that different distros [treat the creation of device files differently but automatically|https://github.com/NVIDIA/nvidia-modprobe/blob/master/nvidia-modprobe.c#L18-L23], and thus poking around in various places in a distro dependent manner might work although hopefully there's a better way than that. I am not clear that the way GPUs have device files created for them via the {{nvidia-modprobe}} utility is deterministic and I suspect it is not, as it seems that [matching devices are just counted|https://github.com/NVIDIA/nvidia-modprobe/blob/master/modprobe-utils/pci-sysfs.c#L146-L158] to figure out [how many device files to create|https://github.com/NVIDIA/nvidia-modprobe/blob/master/nvidia-modprobe.c#L192-L201]- I probably didn't spend enough time going over the code there to offer any definitive insights, though.

I did find, on CentOS 6.4 which is what we're running on, a {{/proc/driver/nvidia/gpus/[0,1,etc?]}} directory for each GPU where the file {{/proc/driver/nvidia/gpus/0/information}} reads like this for an unprivileged user:

bq. Model:           Tesla K20m
IRQ:             40
GPU UUID:        GPU-????????-????-????-????-????????????
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        40 bits
DMA Mask:        0xffffffffff
Bus Location:    0000:20.00.0

So an ugly hack for my specific case would be to see if those 0/1 numbers correspond to /dev/nvidia[0,1], and if so I can just check the bus location info and parse the directory name instead of using NVML. Seems pretty bad.

WDYT?







was (Author: dylanht):
Hi Kevin,

First of all, I did just hit this one first. I will cross reference the NVML changelog and the Mesos code for additional symbols that might need a redundancy when I get a chance.

Second, I am not sure that {{nvidia-smi}} even tries to get the minor number in earlier versions, as from what I've read it essentially wraps the NVML library. The man page on our cluster for driver version 319.72 / CUDA 5.5 does not have the "GPU Attributes -> Minor Number" entry that is on the manpage for [later versions of the driver|http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf]. Might not be able to use the cuda runtime either, as CUDA and NVML can enumerate the device IDs differently according to [{{nvidia-smi}}|https://github.com/al42and/cuda-smi] - that page also offers an anecdote about CUDA 7.0 onwards including the ability to set CUDA_DEVICE_ORDER=PCI_BUS_ID which "makes this tool slightly less useful", but the [{{nvidia-docker}}|https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation] explanation of GPU isolation indicates that the PCI Bus ordering may not be consistent with the device character file minor number anyways. I also don't like CUDA_VISIBLE_DEVICES being a factor, but perhaps I'm just being paranoid. The nvidia-smi manpage also offers that "It is recommended that users desiring consistency use either UUID or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent between reboots and board serial number might be shared between multiple GPUs on the same board". I'm not the best person to interpret what this all means, but these are the places I've been looking for reference.

To me this suggested that the best way might be to figure out which character device file in {{/dev/nvidia*}} maps to which PCI bus location to correlate the minor number with each GPUs UUID? I was quite sure that it would be easy to find a canonical way to just take the Major/Minor number of the {{/dev/nvidia1}} device file for example and figure out the PCI info for each device associated with that file - but no luck yet. Also the {{nvidia-modprobe}} project led me to believe that different distros [treat the creation of device files differently but automatically|https://github.com/NVIDIA/nvidia-modprobe/blob/master/nvidia-modprobe.c#L18-L23], and thus poking around in various places in a distro dependent manner might work although hopefully there's a better way than that. I am not clear that the way GPUs have device files created for them via the {{nvidia-modprobe}} utility is deterministic and I suspect it is not, as it seems that [matching devices are just counted|https://github.com/NVIDIA/nvidia-modprobe/blob/master/modprobe-utils/pci-sysfs.c#L146-L158] to figure out [how many device files to create|https://github.com/NVIDIA/nvidia-modprobe/blob/master/nvidia-modprobe.c#L192-L201]- I probably didn't spend enough time going over the code there to offer any definitive insights, though.

I did find, on CentOS 6.4 which is what we're running on, a {{/proc/driver/nvidia/gpus/[0,1,etc?]}} directory for each GPU where the file {{/proc/driver/nvidia/gpus/0/information}} reads like this for an unprivileged user:

bq. Model:           Tesla K20m
IRQ:             40
GPU UUID:        GPU-????????-????-????-????-????????????
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        40 bits
DMA Mask:        0xffffffffff
Bus Location:    0000:20.00.0

So an ugly hack for my specific case would be to see if those 0/1 numbers correspond to /dev/nvidia[0,1], and if so I can just check the bus location info and parse the directory name instead of using NVML. Seems pretty bad.

WDYT?






> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls?
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-6383
>                 URL: https://issues.apache.org/jira/browse/MESOS-6383
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 1.0.1
>            Reporter: Dylan Bethune-Waddell
>            Priority: Minor
>              Labels: gpu
>
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We are not in a position to upgrade the Nvidia drivers in the near future, and are currently at driver version 319.72
> When attempting to launch an agent with the following command and take advantage of Nvidia GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and later as per info under the [Changes between NVML v5.319 Update and v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to enable support for older versions of the Nvidia driver? Based on discussion in the design document, obtaining this information from the {{nvidia-smi}} command output is a feasible alternative. 
> I am willing to submit a PR that amends the behaviour of {{NvidiaGpuAllocator}} such that it first attempts calls to {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if found on path and parses the output to obtain this information. Otherwise, raise an exception indicating all this was attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful contribution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)