You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Zhankun Tang (JIRA)" <ji...@apache.org> on 2019/02/01 04:18:00 UTC

[jira] [Commented] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation and use the Nvidia GPU plugin as an example

    [ https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757952#comment-16757952 ] 

Zhankun Tang commented on YARN-9060:
------------------------------------

[~sunilg] , Thanks for the review!
{quote}In below comments, it's better to have majorNumber minorNumber example given in same line rather than giving in a line above.
{code:java}
# #for instance, "8:16,8:32"
# devices.denied-numbers=## Blacklisted devices not permitted to use. The format is comma separated "majorNumber:minorNumber". Leave it empty means default devices reported by device plugin are all allowed.
\ No newline at end of file{code}
{quote}
Zhankun=> Fixed.
{quote}In {{getRegisterRequestInfo}} , could we set the "nvidia.com/gpu" as a static value?
{quote}
Zhankun=> Fixed.
{quote}Its better to place NvidiaCommandExecutor under test package.
{quote}
Zhankun=> Do you mean import this class from test package in Nvidia plugin?
{quote}Thanks for adding detailed logs. it certainly helpful. Could you also pls improve some logs in updateDockerRunCommand to add container id or other dynamic values as well. pls refer below logs which lacks some more dynamic information. Also pls see other methods as well to check the same.
 
{quote}
Zhankun=> Fixed.
{quote}requestsDevice ==> requestedDevice
{quote}
Zhankun=> Fixed.
{quote}Is there any chance that getCleanupDockerVolumesCommand wont be called in some failure case, if there is such chance, i worry about the cached data structures which may cause some leaks.
{quote}
Zhankun=> I've changed the concurrentHashMap to an LRU cache with max size. Does this make sense?
{quote}
In getDeviceType, i prefer to keep a new enum for more readability than keeping "c" or "b"
{quote}
Zhankun=> Fixed.

> [YARN-8851] Phase 1 - Support device isolation and use the Nvidia GPU plugin as an example
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-9060
>                 URL: https://issues.apache.org/jira/browse/YARN-9060
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, YARN-9060-trunk.009.patch, YARN-9060-trunk.010.patch, YARN-9060-trunk.011.patch, YARN-9060-trunk.012.patch, YARN-9060-trunk.013.patch, YARN-9060-trunk.014.patch, YARN-9060-trunk.015.patch, YARN-9060-trunk.016.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update the value of the device cgroups controller unless we have the root permission ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]). So we need to support this in container-executor for Java layer to invoke.
> This Jira will have three parts:
>  # native c-e module
>  # Java layer code to isolate devices for container (docker and non-docker)
>  # A sample Nvidia GPU plugin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org