You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/24 16:12:49 UTC
[GitHub] zhangshiyu01 opened a new issue #2959: How to run distributed training using yarn?
zhangshiyu01 opened a new issue #2959: How to run distributed training using yarn?
URL: https://github.com/apache/incubator-mxnet/issues/2959
../../tools/launch.py -n 2 --launcher yarn python train_mnist.py --network lenet --kv-store dist_sync
Traceback (most recent call last):
File "/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py", line 81, in <module>
main()
File "/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py", line 30, in main
assert cluster is not None, 'need to have DMLC_JOB_CLUSTER'
AssertionError: need to have DMLC_JOB_CLUSTER
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 365, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1
yarn 2 -------------/usr/local/jdk1.6.0_45/bin/java -cp /usr/local/hadoop-2.4.0/etc/hadoop:/usr/local/hadoop-2.4.0/share/hadoop/common/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/common/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/_:/usr/local/hadoop-2.4.0/contrib/capacity-scheduler/*.jar:/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client -file /data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar -file train_mnist.py -file /data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py -jobname DMLC[nworker=2,nsever=2]:python -tempdir /tmp -queue default ./launcher.py python ./train_mnist.py
--network lenet --kv-store dist_sync
16/08/08 15:59:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=ads, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5546)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5528)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5493)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3632)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3602)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3576)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:760)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:560)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
```
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2567)
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2536)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:835)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:831)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:831)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:824)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:595)
at org.apache.hadoop.yarn.dmlc.Client.setupCacheFiles(Client.java:134)
at org.apache.hadoop.yarn.dmlc.Client.run(Client.java:282)
at org.apache.hadoop.yarn.dmlc.Client.main(Client.java:348)
```
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=ads, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5546)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5528)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5493)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3632)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3602)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3576)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:760)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:560)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
```
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:502)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2565)
... 11 more
```
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/yarn.py", line 114, in run
subprocess.check_call(cmd, shell=True, env=env)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/jdk1.6.0_45/bin/java -cp /usr/local/hadoop-2.4.0/etc/hadoop:/usr/local/hadoop-2.4.0/share/hadoop/common/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/common/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/_:/usr/local/hadoop-2.4.0/contrib/capacity-scheduler/*.jar:/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client -file /data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar -file train_mnist.py -file /data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py -jobname DMLC[nworker=2,nsever=2]:python -tempdir /tmp -queue default ./launcher.py python ./train
_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services