You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/24 16:12:49 UTC

[GitHub] zhangshiyu01 opened a new issue #2959: How to run distributed training using yarn?

zhangshiyu01 opened a new issue #2959: How to run distributed training using yarn?
URL: https://github.com/apache/incubator-mxnet/issues/2959
 
 
    ../../tools/launch.py -n 2 --launcher yarn python train_mnist.py --network lenet --kv-store dist_sync
   
   Traceback (most recent call last):
     File "/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py", line 81, in <module>
       main()
     File "/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py", line 30, in main
       assert cluster is not None, 'need to have DMLC_JOB_CLUSTER'
   AssertionError: need to have DMLC_JOB_CLUSTER
   Exception in thread Thread-1:
   Traceback (most recent call last):
     File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
       self.run()
     File "/usr/lib/python2.7/threading.py", line 754, in run
       self.__target(_self.__args, *_self.__kwargs)
     File "/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 365, in <lambda>
       target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
     File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
       raise CalledProcessError(retcode, cmd)
   CalledProcessError: Command '/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1
   
   yarn 2 -------------/usr/local/jdk1.6.0_45/bin/java -cp /usr/local/hadoop-2.4.0/etc/hadoop:/usr/local/hadoop-2.4.0/share/hadoop/common/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/common/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/_:/usr/local/hadoop-2.4.0/contrib/capacity-scheduler/*.jar:/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client  -file /data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar -file train_mnist.py -file /data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py -jobname DMLC[nworker=2,nsever=2]:python  -tempdir /tmp  -queue default ./launcher.py python ./train_mnist.py
  --network lenet --kv-store dist_sync
   
   16/08/08 15:59:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=ads, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5546)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5528)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5493)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3632)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3602)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3576)
       at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:760)
       at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:560)
       at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
       at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:396)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
   
   ```
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
   at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
   at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2567)
   at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2536)
   at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:835)
   at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:831)
   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:831)
   at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:824)
   at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
   at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:595)
   at org.apache.hadoop.yarn.dmlc.Client.setupCacheFiles(Client.java:134)
   at org.apache.hadoop.yarn.dmlc.Client.run(Client.java:282)
   at org.apache.hadoop.yarn.dmlc.Client.main(Client.java:348)
   ```
   
   Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=ads, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5546)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5528)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5493)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3632)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3602)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3576)
       at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:760)
       at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:560)
       at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
       at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:396)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
   
   ```
   at org.apache.hadoop.ipc.Client.call(Client.java:1410)
   at org.apache.hadoop.ipc.Client.call(Client.java:1363)
   at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
   at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:502)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
   at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
   at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
   at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2565)
   ... 11 more
   ```
   
   Exception in thread Thread-2:
   Traceback (most recent call last):
     File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
       self.run()
     File "/usr/lib/python2.7/threading.py", line 754, in run
       self.__target(_self.__args, *_self.__kwargs)
     File "/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/yarn.py", line 114, in run
       subprocess.check_call(cmd, shell=True, env=env)
     File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
       raise CalledProcessError(retcode, cmd)
   CalledProcessError: Command '/usr/local/jdk1.6.0_45/bin/java -cp /usr/local/hadoop-2.4.0/etc/hadoop:/usr/local/hadoop-2.4.0/share/hadoop/common/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/common/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/yarn/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/lib/_:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/_:/usr/local/hadoop-2.4.0/contrib/capacity-scheduler/*.jar:/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client  -file /data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar -file train_mnist.py -file /data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py -jobname DMLC[nworker=2,nsever=2]:python  -tempdir /tmp  -queue default ./launcher.py python ./train
 _mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services