You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Chun-Hung Hsiao <ch...@apache.org> on 2019/02/20 06:26:23 UTC

Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/
-----------------------------------------------------------

Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.


Bugs: MESOS-9549
    https://issues.apache.org/jira/browse/MESOS-9549


Repository: mesos


Description
-------

nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
injection that was originally done in the image to its new nvidia
container runtime. To adapt this change, we adjusted the binaries and
libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
variables in the `gpu/nvidia` isolator.


Diffs
-----

  src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
  src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
  src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 


Diff: https://reviews.apache.org/r/70016/diff/1/


Testing
-------

`sudo make check`

More testing done later in chain.


Thanks,

Chun-Hung Hsiao


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review213013
-----------------------------------------------------------




src/slave/containerizer/mesos/isolators/gpu/volume.cpp
Line 418 (original), 474 (patched)
<https://reviews.apache.org/r/70016/#comment298865>

    The new cuda images contain compatibility libraries and those also need to be bind-mounted or copied to the libdirs:
    https://github.com/NVIDIA/libnvidia-container/blob/fe20a8e4a17a63df8116f39795173a461325fb3d/src/nvc_container.c#L195-L217
    https://github.com/NVIDIA/libnvidia-container/blob/fe20a8e4a17a63df8116f39795173a461325fb3d/src/nvc_mount.c#L486-L500


- Chun-Hung Hsiao


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 20, 2019, 9:50 p.m., Kevin Klues wrote:
> > src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
> > Lines 418 (patched)
> > <https://reviews.apache.org/r/70016/diff/1/?file=2125909#file2125909line418>
> >
> >     This has the same limitations that the original nvidia-docker does, it's just that we inject these envvars here now instead of relying on the docker image to do it.
> >     
> >     If you really want to change to the true nvidia-docker2 model, you need to follow the logic in https://github.com/NVIDIA/libnvidia-container
> >     
> >     The libraries are no longer injected in /usr/lib/nvidia, but rather in default lib locations, with ldcache inside the container being updated after the injection.
> 
> Chun-Hung Hsiao wrote:
>     Thanks Kevin. Yeah I'm aware of the `libnvidia-container` and updated `BINARIES` and `LIBRARIES` based on that, but haven't dug into the repo much yet. I'll take a look at how thing are currently done in the repo and see if we can adapt the new logic with minimal changes in Mesos.

Does it worth the complexity to detect what "default" libdirs to use like this? https://github.com/NVIDIA/libnvidia-container/blob/fe20a8e4a17a63df8116f39795173a461325fb3d/src/nvc_container.c#L276-L311

I'm fine bind-mounting the binaries into `/usr/bin` (https://github.com/NVIDIA/libnvidia-container/blob/fe20a8e4a17a63df8116f39795173a461325fb3d/src/nvc_mount.c#L465) so we don't need to inject `PATH`, but the above logic looks hacky.

WDYT?


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212986
-----------------------------------------------------------


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 20, 2019, 9:50 p.m., Kevin Klues wrote:
> > src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
> > Lines 418 (patched)
> > <https://reviews.apache.org/r/70016/diff/1/?file=2125909#file2125909line418>
> >
> >     This has the same limitations that the original nvidia-docker does, it's just that we inject these envvars here now instead of relying on the docker image to do it.
> >     
> >     If you really want to change to the true nvidia-docker2 model, you need to follow the logic in https://github.com/NVIDIA/libnvidia-container
> >     
> >     The libraries are no longer injected in /usr/lib/nvidia, but rather in default lib locations, with ldcache inside the container being updated after the injection.

Thanks Kevin. Yeah I'm aware of the `libnvidia-container` and updated `BINARIES` and `LIBRARIES` based on that, but haven't dug into the repo much yet. I'll take a look at how thing are currently done in the repo and see if we can adapt the new logic with minimal changes in Mesos.


> On Feb. 20, 2019, 9:50 p.m., Kevin Klues wrote:
> > src/slave/containerizer/mesos/isolators/gpu/volume.cpp
> > Lines 203-249 (patched)
> > <https://reviews.apache.org/r/70016/diff/1/?file=2125911#file2125911line213>
> >
> >     This feels like a big hack.
> >     
> >     I guess the main idea though, is to do what the original nvidia-docker used to do in the container image with PATH and LD_LIBRARY_PATH (including all of its limitations), but do it manually in Mesos if / when we detect that a GPU is needed by a container.

Yeah a big hack indeed lol.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212986
-----------------------------------------------------------


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 20, 2019, 9:50 p.m., Kevin Klues wrote:
> > src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
> > Lines 418 (patched)
> > <https://reviews.apache.org/r/70016/diff/1/?file=2125909#file2125909line418>
> >
> >     This has the same limitations that the original nvidia-docker does, it's just that we inject these envvars here now instead of relying on the docker image to do it.
> >     
> >     If you really want to change to the true nvidia-docker2 model, you need to follow the logic in https://github.com/NVIDIA/libnvidia-container
> >     
> >     The libraries are no longer injected in /usr/lib/nvidia, but rather in default lib locations, with ldcache inside the container being updated after the injection.
> 
> Chun-Hung Hsiao wrote:
>     Thanks Kevin. Yeah I'm aware of the `libnvidia-container` and updated `BINARIES` and `LIBRARIES` based on that, but haven't dug into the repo much yet. I'll take a look at how thing are currently done in the repo and see if we can adapt the new logic with minimal changes in Mesos.
> 
> Chun-Hung Hsiao wrote:
>     Does it worth the complexity to detect what "default" libdirs to use like this? https://github.com/NVIDIA/libnvidia-container/blob/fe20a8e4a17a63df8116f39795173a461325fb3d/src/nvc_container.c#L276-L311
>     
>     I'm fine bind-mounting the binaries into `/usr/bin` (https://github.com/NVIDIA/libnvidia-container/blob/fe20a8e4a17a63df8116f39795173a461325fb3d/src/nvc_mount.c#L465) so we don't need to inject `PATH`, but the above logic looks hacky.
>     
>     WDYT?

Discussed offline. The short-term goal is to upstream this patch so we can support new CUDA images, and the long-term goal is to migrate to `libnvidia-container`.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212986
-----------------------------------------------------------


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Kevin Klues <kl...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212986
-----------------------------------------------------------




src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 418 (patched)
<https://reviews.apache.org/r/70016/#comment298847>

    This has the same limitations that the original nvidia-docker does, it's just that we inject these envvars here now instead of relying on the docker image to do it.
    
    If you really want to change to the true nvidia-docker2 model, you need to follow the logic in https://github.com/NVIDIA/libnvidia-container
    
    The libraries are no longer injected in /usr/lib/nvidia, but rather in default lib locations, with ldcache inside the container being updated after the injection.



src/slave/containerizer/mesos/isolators/gpu/volume.cpp
Lines 203-249 (patched)
<https://reviews.apache.org/r/70016/#comment298846>

    This feels like a big hack.
    
    I guess the main idea though, is to do what the original nvidia-docker used to do in the container image with PATH and LD_LIBRARY_PATH (including all of its limitations), but do it manually in Mesos if / when we detect that a GPU is needed by a container.


- Kevin Klues


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported CUDA 10+ images that are based on nvidia-container-runtime.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 25, 2019, 5:24 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/isolators/gpu/volume.cpp
> > Lines 493 (patched)
> > <https://reviews.apache.org/r/70016/diff/5/?file=2126232#file2126232line502>
> >
> >     Can you also add a TODO here about relying on `NVIDIA_DRIVER_CAPABILITIES` env var instead in the future?

After looking into the semantics of `NVIDIA_DRIVER_CAPABILITIES`, it seems better if we use the `NVIDIA_VISIBLE_DEVICES`:
`NVIDIA_VISIBLE_DEVICES`: void or empty or unset: nvidia-container-runtime will have the same behavior as runc.
`NVIDIA_DRIVER_CAPABILITIES`: empty or unset: use default driver capability: utility.
So the real control is `NVIDIA_VISIBLE_DEVICES`.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review213171
-----------------------------------------------------------


On Feb. 21, 2019, 11:03 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 11:03 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Nvidia's CUDA 10+ images are based on nvidia-container-runtime and thus
> the runtime injection are from the images themselves. To adapt this
> change, we adjusted the binaries and libraries and injected the `PATH`
> and `LD_LIBRARY_PATH` environment variables in the `gpu/nvidia`
> isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/5/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported CUDA 10+ images that are based on nvidia-container-runtime.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review213171
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/containerizer/mesos/isolators/gpu/volume.cpp
Lines 493 (patched)
<https://reviews.apache.org/r/70016/#comment299042>

    Can you also add a TODO here about relying on `NVIDIA_DRIVER_CAPABILITIES` env var instead in the future?


- Jie Yu


On Feb. 21, 2019, 11:03 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 11:03 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Nvidia's CUDA 10+ images are based on nvidia-container-runtime and thus
> the runtime injection are from the images themselves. To adapt this
> change, we adjusted the binaries and libraries and injected the `PATH`
> and `LD_LIBRARY_PATH` environment variables in the `gpu/nvidia`
> isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/5/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported CUDA 10+ images that are based on nvidia-container-runtime.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/
-----------------------------------------------------------

(Updated Feb. 21, 2019, 11:03 p.m.)


Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.


Changes
-------

Updated the order of library path injection.


Bugs: MESOS-9549
    https://issues.apache.org/jira/browse/MESOS-9549


Repository: mesos


Description
-------

Nvidia's CUDA 10+ images are based on nvidia-container-runtime and thus
the runtime injection are from the images themselves. To adapt this
change, we adjusted the binaries and libraries and injected the `PATH`
and `LD_LIBRARY_PATH` environment variables in the `gpu/nvidia`
isolator.


Diffs (updated)
-----

  src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
  src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
  src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 


Diff: https://reviews.apache.org/r/70016/diff/3/

Changes: https://reviews.apache.org/r/70016/diff/2-3/


Testing
-------

`sudo make check`

More testing done later in chain.


Thanks,

Chun-Hung Hsiao


Re: Review Request 70016: Supported CUDA 10+ images that are based on nvidia-container-runtime.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/
-----------------------------------------------------------

(Updated Feb. 21, 2019, 10:55 p.m.)


Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.


Changes
-------

Addressed comments.


Summary (updated)
-----------------

Supported CUDA 10+ images that are based on nvidia-container-runtime.


Bugs: MESOS-9549
    https://issues.apache.org/jira/browse/MESOS-9549


Repository: mesos


Description (updated)
-------

Nvidia's CUDA 10+ images are based on nvidia-container-runtime and thus
the runtime injection are from the images themselves. To adapt this
change, we adjusted the binaries and libraries and injected the `PATH`
and `LD_LIBRARY_PATH` environment variables in the `gpu/nvidia`
isolator.


Diffs (updated)
-----

  src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
  src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
  src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 


Diff: https://reviews.apache.org/r/70016/diff/2/

Changes: https://reviews.apache.org/r/70016/diff/1-2/


Testing
-------

`sudo make check`

More testing done later in chain.


Thanks,

Chun-Hung Hsiao


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212945
-----------------------------------------------------------




src/slave/containerizer/mesos/isolators/gpu/volume.cpp
Lines 496 (patched)
<https://reviews.apache.org/r/70016/#comment298820>

    Should be `tokens[1] == "" || tokens[1] == "..."`.


- Chun-Hung Hsiao


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Jie Yu <yu...@gmail.com>.

> On Feb. 20, 2019, 6:34 a.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
> > Lines 418 (patched)
> > <https://reviews.apache.org/r/70016/diff/1/?file=2125909#file2125909line418>
> >
> >     Ideally, for PATH and LD_LIBRARY_PATH, we don't overwrite but append. Imagine a docker container that already defines LD_LIBRARY_PATH by itself. Doing this will probably break the image?
> >     
> >     Curious what's your thoughts on this.

Discussed offline. I'll be fine with a TODO here.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212943
-----------------------------------------------------------


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70016: Supported nvidia-docker 2.0 for CUDA 10+.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70016/#review212943
-----------------------------------------------------------




src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 418 (patched)
<https://reviews.apache.org/r/70016/#comment298818>

    Ideally, for PATH and LD_LIBRARY_PATH, we don't overwrite but append. Imagine a docker container that already defines LD_LIBRARY_PATH by itself. Doing this will probably break the image?
    
    Curious what's your thoughts on this.


- Jie Yu


On Feb. 20, 2019, 6:26 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70016/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:26 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Kevin Klues.
> 
> 
> Bugs: MESOS-9549
>     https://issues.apache.org/jira/browse/MESOS-9549
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> nvidia-docker 2.0, which is used by CUDA 10+, moves some of the runtime
> injection that was originally done in the image to its new nvidia
> container runtime. To adapt this change, we adjusted the binaries and
> libraries and injected the `PATH` and `LD_LIBRARY_PATH` environment
> variables in the `gpu/nvidia` isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp f39e7c3d1ccfe097116fe59b05c22fbb3f83b698 
>   src/slave/containerizer/mesos/isolators/gpu/volume.hpp e71fe95234ff10c72cfaa4ad39591f70a531c383 
>   src/slave/containerizer/mesos/isolators/gpu/volume.cpp 0d0d778d6a8467c1ac87286e75d47faf8243afa4 
> 
> 
> Diff: https://reviews.apache.org/r/70016/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>