You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by "jblache (via GitHub)" <gi...@apache.org> on 2023/01/30 19:37:24 UTC

[GitHub] [mesos] jblache opened a new pull request, #454: Support for nvidia MIG in Mesos containerizer

jblache opened a new pull request, #454:
URL: https://github.com/apache/mesos/pull/454

   This PR contributes support for static MIG (multi-instance GPU) configurations in the Mesos containerizer, whereby nvidia GPUs can be divided into isolated instances to better use GPU resources.
   
   With these changes, Mesos will be able to pick up the MIG configuration and properly report available MIG GPUs instead of reporting the underlying, physical GPUs.
   
   I have made an attempt at playing nice with the `--resources` and `--nvidia_gpu_devices` agent options, however more scrutiny/work will definitely be beneficial in this area (we don't use these options in our setup).
   
   In addition to this PR, an update to the NVML library is required. These changes were built and tested with NVML taken from the 460.73.01 driver package.
   
   This code has been running in production on A100 GPUs for several months now and is being contributed back in the hope it will be useful and beneficial to the community. Much more can be done with MIG, and this only scratches the surface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] andreaspeters commented on pull request #454: Support for nvidia MIG in Mesos containerizer

Posted by "andreaspeters (via GitHub)" <gi...@apache.org>.
andreaspeters commented on PR #454:
URL: https://github.com/apache/mesos/pull/454#issuecomment-1433082532

   For me it looks fine. How about you @cf-natali @qianzhangxa ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] jblache commented on pull request #454: Support for nvidia MIG in Mesos containerizer

Posted by "jblache (via GitHub)" <gi...@apache.org>.
jblache commented on PR #454:
URL: https://github.com/apache/mesos/pull/454#issuecomment-1410699238

   Pushed a more granular version of the branch, which at least splits the discovery vs. isolation bits. Original postback was squashed for internal review reasons mostly.
   
   On the approach, the NVML wrapper was updated to add support for the calls needed to discover/query MIG GPUs, and a couple of those added interfaces have a bit of logic included that keeps the calling code cleaner and overall makes sense.
   
   GPU enumeration will check for MIG and enumerate MIG devices if present, instead of the underlying GPU.
   
   The isolator now has to include all the nvidia-caps device nodes, which is a lot of them, but anything more granular seems pretty involved and brittle (IIRC one of the reasons I didn't look into it more is that it would likely get in the the way of a dynamic configuration).
   
   The tricky bit is reconciling device allocations upon restart with running jobs. It turned out to be not too bad, but the code grew accordingly there. I forget the details, but there's a bit more data that needs to be kept around to be able to match everything up, and the matching logic needs to look for MIG vs. not MIG, basically.
   
   This was a tactical patch for us and, as you can see, it sat on my TODO list for bit before I could get it out to you. I don't have bandwidth currently to engage in a full-blown review process, but wanted to get the code out for anyone who might have similar needs, and as groundwork for more evolved capabilities around MIG.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] bmahler commented on pull request #454: Support for nvidia MIG in Mesos containerizer

Posted by "bmahler (via GitHub)" <gi...@apache.org>.
bmahler commented on PR #454:
URL: https://github.com/apache/mesos/pull/454#issuecomment-1409388019

   Hi @jblache, cool to see this getting upstreamed. A couple of suggestions:
   
   * Can you a more detailed overview of the implementation approach in the PR description? It's a bit hard to dive into reviewing this code from the current description.
   * To ease reviewing, typically we break apart patches. For example, in this PR, the nvml.hpp/cpp files can easily be their own commit for review purposes. I haven't used github PRs for reviewing stacks of commits, but I think if you just split the commits and use multiple in the PR, it should be at least reviewable in chunks, but probably not as directly committable in chunks. If you want to use reviewboard I think there is still support with ./support/post-reviews.py
   
   If others chime in here willing to review, perhaps we can organize a meeting to review more easily. I'm not active in mesos lately but would be willing to provide some feedback here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org