You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (Jira)" <ji...@apache.org> on 2020/07/28 16:31:00 UTC

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

    [ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166532#comment-17166532 ] 

Xiangrui Meng edited comment on SPARK-32429 at 7/28/20, 4:30 PM:
-----------------------------------------------------------------

[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. Not sure if the same setting were used in YARN/k8s.


was (Author: mengxr):
[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same.

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> ---------------------------------------------------------------------
>
>                 Key: SPARK-32429
>                 URL: https://issues.apache.org/jira/browse/SPARK-32429
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy
>    Affects Versions: 3.0.0
>            Reporter: Thomas Graves
>            Priority: Major
>
> It would be nice if standalone mode could allow users to set CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device id, setting this will make any GPU look like the default (id 0) and things generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting of CUDA_VISIBLE_DEVICES like MIG ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would also possibly need to change the gpu addresses as it changes them to start from device id 0 again.
> The easiest implementation would just specifically support this and have it behind a config and set when the config is on and GPU resources are allocated. 
> Note we probably want to have this same thing set when we launch a python process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org