You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/18 13:29:52 UTC

[GitHub] ilibx opened a new issue #9477: distributed trainning with mxnet, get error "ImportError: No module named numpy"

ilibx opened a new issue #9477: distributed trainning with mxnet, get error "ImportError: No module named numpy"
URL: https://github.com/apache/incubator-mxnet/issues/9477
 
 
   ## Description
   Following by the install guide i built  mxnet in the docker, and started multi-container then i run the exmple as bellow:
   ```bash
   root@mxnet1:~/mxnet/example/image-classification# python train_mnist.py --network lenet
   /root/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
     import OpenSSL.SSL
   INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
   INFO:root:Epoch[0] Batch [100]	Speed: 590.42 samples/sec	accuracy=0.863243
   INFO:root:Epoch[0] Batch [200]	Speed: 536.00 samples/sec	accuracy=0.946562
   INFO:root:Epoch[0] Batch [300]	Speed: 596.90 samples/sec	accuracy=0.973125
   INFO:root:Epoch[0] Batch [400]	Speed: 643.99 samples/sec	accuracy=0.969688
   INFO:root:Epoch[0] Batch [500]	Speed: 538.20 samples/sec	accuracy=0.971875
   INFO:root:Epoch[0] Batch [600]	Speed: 576.18 samples/sec	accuracy=0.969063
   INFO:root:Epoch[0] Batch [700]	Speed: 550.20 samples/sec	accuracy=0.974375
   INFO:root:Epoch[0] Batch [800]	Speed: 522.18 samples/sec	accuracy=0.975156
   INFO:root:Epoch[0] Batch [900]	Speed: 540.79 samples/sec	accuracy=0.977812
   INFO:root:Epoch[0] Train-accuracy=0.979307
   INFO:root:Epoch[0] Time cost=105.686
   INFO:root:Epoch[0] Validation-accuracy=0.985370
   ```
   then i stepped to train with multi-container, so i do it like this:
   ```bash
   root@mxnet1:~/mxnet/example/image-classification# ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync
   Warning: Permanently added '172.17.3.4' (ECDSA) to the list of known hosts.
   Warning: Permanently added '172.17.3.5' (ECDSA) to the list of known hosts.
   Traceback (most recent call last):
     File "train_mnist.py", line 25, in <module>
       from common import find_mxnet, fit
     File "/root/mxnet/example/image-classification/common/find_mxnet.py", line 24, in <module>
       import mxnet as mx
     File "/root/mxnet/example/image-classification/common/../../../python/mxnet/__init__.py", line 25, in <module>
       from . import engine
     File "/root/mxnet/example/image-classification/common/../../../python/mxnet/engine.py", line 23, in <module>
       from .base import _LIB, check_call
     File "/root/mxnet/example/image-classification/common/../../../python/mxnet/base.py", line 29, in <module>
       import numpy as np
   ImportError: No module named numpy
   Traceback (most recent call last):
     File "train_mnist.py", line 25, in <module>
       from common import find_mxnet, fit
     File "/root/mxnet/example/image-classification/common/find_mxnet.py", line 24, in <module>
       import mxnet as mx
     File "/root/mxnet/example/image-classification/common/../../../python/mxnet/__init__.py", line 25, in <module>
       from . import engine
     File "/root/mxnet/example/image-classification/common/../../../python/mxnet/engine.py", line 23, in <module>
       from .base import _LIB, check_call
     File "/root/mxnet/example/image-classification/common/../../../python/mxnet/base.py", line 29, in <module>
       import numpy as np
   ImportError: No module named numpy
   Exception in thread Thread-3:
   Traceback (most recent call last):
     File "/root/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
       self.run()
     File "/root/anaconda3/lib/python3.6/threading.py", line 864, in run
       self._target(*self._args, **self._kwargs)
     File "/root/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
       subprocess.check_call(prog, shell = True)
     File "/root/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
       raise CalledProcessError(retcode, cmd)
   subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.17.3.5 -p 22 'export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export DMLC_PS_ROOT_URI=172.17.3.3; export DMLC_PS_ROOT_PORT=9092; export DMLC_ROLE=worker; cd /root/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 1.
   
   Exception in thread Thread-2:
   Traceback (most recent call last):
     File "/root/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
       self.run()
     File "/root/anaconda3/lib/python3.6/threading.py", line 864, in run
       self._target(*self._args, **self._kwargs)
     File "/root/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
       subprocess.check_call(prog, shell = True)
     File "/root/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
       raise CalledProcessError(retcode, cmd)
   subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.17.3.4 -p 22 'export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export DMLC_PS_ROOT_URI=172.17.3.3; export DMLC_PS_ROOT_PORT=9092; export DMLC_ROLE=server; cd /root/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 1.
   ````
   envs:
   ```env
    anaconda 3
    ssh -  no passwd
   ```
   
   mxnet build cmd:
   ```bash
   make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=1 
   ```
   i don't know how to solve it, please help me fix it..
   
   thanks alot
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services