You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by zxcs <zh...@163.com> on 2022/10/31 14:21:00 UTC

issue when enable gpu isolation

Hi, experts,

we are using hadoop-3.3.0 and trying using cpu also enable gpu isolation following guide https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html <https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html>

but when we start a  yarn job, node manager always failed at unexpected operation code:-1 , could  any experts shed some light here? Thanks in advance!

(sorry for the picture due, this due to we banned the copy anything from testbed to outside)





here is the yarn-site.xml config 
<property>
<name>yarn.resource-types< /name>
<value>yarn.io/gpu< /value>
</property>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</ property>

and below is obtainer-executor.cfg
     yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=root
min.user.id=500
allowed.system.users=yarn
[gpu]
module.enabled=true
[cgroups]
root=/sys/fs/cgroup
yarn-hierarchy=yarn

below is the directory of /sys/fs/cgroup



Re: issue when enable gpu isolation

Posted by zxcs <zh...@163.com>.
Also when we directly use container-executor command to put something into devices.deny, it report unexpected operation code.

test@ip:/opt/hadoop-3.3.0$ sudo -U yarn /opt/hadoop-3.3.0/bin/container-executor  --module-gpu --container_id container_e57_1667177358230_0650_01_000001
-excluded_gpus 1,2,3,4,5,6,7
[sudo〕 password for alpha:
CGroups: Updating cgroups, path=/sys/fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_000001/devices.deny, value=c 195:1 rwm
CGroups: Updating cgroups, path=/sys/fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_000001/devices.deny, value=c 195:2 rwm
CGroups: Updating cgroups, path=/ sys/fs/cgroup/devices/yarn/container_e57_1667177358230 0650 01 000001/devices.deny, value=c 195:3 rwm
CGroups: Updating cgroups, path=/sys/fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_000001/devices.deny, value=c 195:4 rwm
CGroups: Updating cgroups, path=/sys/ fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_000001/devices.deny, value=c 195:5 rwm
CGroups: Updating cgroups, path=/sys/fs/cgroup/ devices/yarn/container_e57_1667177358230_0650_01_000001/devices.deny, value=c 195:6 rwm
CGroups: Dpaatang SEroupo: Pathg/Bya/4S/Eroup/ aeVicas/arn/ ontatner-es/ 18871773382S8 68s8 f ooooot /aevAces.a8y. value=c 195:7 rwm
Unexpected operation code: -1
Nonzero exit code=3, error message=' Invalid command provided’


Thanks,
Xiong


> 2022年10月31日 22:21,zxcs <zh...@163.com> 写道:
> 
> Hi, experts,
> 
> we are using hadoop-3.3.0 and trying using cpu also enable gpu isolation following guide https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html <https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html>
> 
> but when we start a  yarn job, node manager always failed at unexpected operation code:-1 , could  any experts shed some light here? Thanks in advance!
> 
> (sorry for the picture due, this due to we banned the copy anything from testbed to outside)
> 
> <粘贴的图形-4.tiff>
> 
> 
> 
> here is the yarn-site.xml config 
> <property>
> <name>yarn.resource-types< /name>
> <value>yarn.io/gpu <http://yarn.io/gpu>< /value>
> </property>
> <property>
> <name>yarn.nodemanager.resource-plugins</name>
> <value>yarn.io/gpu <http://yarn.io/gpu></value>
> </ property>
> 
> and below is obtainer-executor.cfg
>      yarn.nodemanager.linux-container-executor.group=hadoop
> banned.users=root
> min.user.id <http://min.user.id/>=500
> allowed.system.users=yarn
> [gpu]
> module.enabled=true
> [cgroups]
> root=/sys/fs/cgroup
> yarn-hierarchy=yarn
> 
> below is the directory of /sys/fs/cgroup
> <粘贴的图形-3.tiff>
>