You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/23 03:00:50 UTC

[GitHub] Feywell opened a new issue #9186: How to train model with multi machines

Feywell opened a new issue #9186: How to train model with multi machines
URL: https://github.com/apache/incubator-mxnet/issues/9186
 
 
   ## Description
   I want to train CNN with multi machines.
   Hardware: cluster ( Tesla K20)
   system: Red Hat Enterprise Linux Server release 6.4 (Santiago)
   When I try to run the` example/image-classification/train_cifar10.py
   `
   **using  comand :** 
   
   > python ../../tools/launch.py -n 4 -H hosts \
       python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3 \
       --kv-store dist_device_sync 2>&1|tee train_cifar10_log
   **error as following:**
   
   > Traceback (most recent call last):
     File "train_cifar10.py", line 19, in <module>
       import argparse
   ImportError: No module named argparse
   Exception in thread Thread-6:
   Traceback (most recent call last):
     File "/home/liyang/anaconda2-5.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
       self.run()
     File "/home/liyang/anaconda2-5.0/lib/python2.7/threading.py", line 754, in run
       self.__target(*self.__args, **self.__kwargs)
     File "/home/liyang/incubator-mxnet-master/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
       subprocess.check_call(prog, shell = True)
     File "/home/liyang/anaconda2-5.0/lib/python2.7/subprocess.py", line 186, in check_call
       raise CalledProcessError(retcode, cmd)
   CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.16.1.182 -p 22 'export LD_LIBRARY_PATH=/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013.3.163/mpirt/lib/intel64:/opt/intel/composer_xe_2013.3.163/ipp/../compiler/lib/intel64:/opt/intel/composer_xe_2013.3.163/ipp/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/composer_xe_2013.3.163/mkl/lib/intel64:/opt/intel/composer_xe_2013.3.163/tbb/lib/intel64/gcc4.1:/home/liyang/usr/lib:/home/liyang/usr/local/lib:/home/liyang/usr/libtool/lib:/home/liyang/usr/local/gcc-5.4.0/lib64:/home/liyang/usr/local/gcc-5.4.0/lib:/home/liyang/usr/local/perl/lib:/usr/local/cuda-8.0/lib64:/home/liyang/anaconda2-5.0/lib::/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/mic/lib; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_PS_ROOT_URI=172.16.1.183; export DMLC_NUM_SERVER=4; export DMLC_NUM_WORKER=4; cd /home/liyang/incubator-mxnet-master/example/image-classification/; python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3 --kv-store dist_device_sync'' returned non-zero exit status 1
   
   **I check my python package, argparse is existed.**
   
   ## Environment info (Required)
   
   > ----------Python Info----------
   > ('Version      :', '2.7.13')
   > ('Compiler     :', 'GCC 7.2.0')
   > ('Build        :', ('default', 'Sep 22 2017 00:47:24'))
   > ('Arch         :', ('64bit', ''))
   > ------------Pip Info-----------
   > ('Version      :', '9.0.1')
   > ('Directory    :', '/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/pip')
   > ----------MXNet Info-----------
   > /home/liyang/anaconda2-5.0/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
   >   import OpenSSL.SSL
   > ('Version      :', '1.0.0')
   > ('Directory    :', '/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet')
   > Traceback (most recent call last):
   >   File "diagnose.py", line 171, in <module>
   >     check_mxnet()
   >   File "diagnose.py", line 113, in check_mxnet
   >     except FileNotFoundError:
   > NameError: global name 'FileNotFoundError' is not defined
   
   
   Package used (Python/R/Scala/Julia):
   (I'm using Python)
   
   ## Build info (Required if built from source)
   
   Compiler?**GCC ( 5.4.0)**
   
   Build config:
   command:  
   
   > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1
   
   **I don't know what errors there are. How can I fix this error?
   Thank you!**

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services