You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/18 13:29:52 UTC
[GitHub] ilibx opened a new issue #9477: distributed trainning with mxnet, get error "ImportError: No module named numpy"
ilibx opened a new issue #9477: distributed trainning with mxnet, get error "ImportError: No module named numpy"
URL: https://github.com/apache/incubator-mxnet/issues/9477
## Description
Following by the install guide i built mxnet in the docker, and started multi-container then i run the exmple as bellow:
```bash
root@mxnet1:~/mxnet/example/image-classification# python train_mnist.py --network lenet
/root/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
import OpenSSL.SSL
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
INFO:root:Epoch[0] Batch [100] Speed: 590.42 samples/sec accuracy=0.863243
INFO:root:Epoch[0] Batch [200] Speed: 536.00 samples/sec accuracy=0.946562
INFO:root:Epoch[0] Batch [300] Speed: 596.90 samples/sec accuracy=0.973125
INFO:root:Epoch[0] Batch [400] Speed: 643.99 samples/sec accuracy=0.969688
INFO:root:Epoch[0] Batch [500] Speed: 538.20 samples/sec accuracy=0.971875
INFO:root:Epoch[0] Batch [600] Speed: 576.18 samples/sec accuracy=0.969063
INFO:root:Epoch[0] Batch [700] Speed: 550.20 samples/sec accuracy=0.974375
INFO:root:Epoch[0] Batch [800] Speed: 522.18 samples/sec accuracy=0.975156
INFO:root:Epoch[0] Batch [900] Speed: 540.79 samples/sec accuracy=0.977812
INFO:root:Epoch[0] Train-accuracy=0.979307
INFO:root:Epoch[0] Time cost=105.686
INFO:root:Epoch[0] Validation-accuracy=0.985370
```
then i stepped to train with multi-container, so i do it like this:
```bash
root@mxnet1:~/mxnet/example/image-classification# ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync
Warning: Permanently added '172.17.3.4' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.17.3.5' (ECDSA) to the list of known hosts.
Traceback (most recent call last):
File "train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/root/mxnet/example/image-classification/common/find_mxnet.py", line 24, in <module>
import mxnet as mx
File "/root/mxnet/example/image-classification/common/../../../python/mxnet/__init__.py", line 25, in <module>
from . import engine
File "/root/mxnet/example/image-classification/common/../../../python/mxnet/engine.py", line 23, in <module>
from .base import _LIB, check_call
File "/root/mxnet/example/image-classification/common/../../../python/mxnet/base.py", line 29, in <module>
import numpy as np
ImportError: No module named numpy
Traceback (most recent call last):
File "train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/root/mxnet/example/image-classification/common/find_mxnet.py", line 24, in <module>
import mxnet as mx
File "/root/mxnet/example/image-classification/common/../../../python/mxnet/__init__.py", line 25, in <module>
from . import engine
File "/root/mxnet/example/image-classification/common/../../../python/mxnet/engine.py", line 23, in <module>
from .base import _LIB, check_call
File "/root/mxnet/example/image-classification/common/../../../python/mxnet/base.py", line 29, in <module>
import numpy as np
ImportError: No module named numpy
Exception in thread Thread-3:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/root/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/root/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
subprocess.check_call(prog, shell = True)
File "/root/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.17.3.5 -p 22 'export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export DMLC_PS_ROOT_URI=172.17.3.3; export DMLC_PS_ROOT_PORT=9092; export DMLC_ROLE=worker; cd /root/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 1.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/root/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/root/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
subprocess.check_call(prog, shell = True)
File "/root/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.17.3.4 -p 22 'export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export DMLC_PS_ROOT_URI=172.17.3.3; export DMLC_PS_ROOT_PORT=9092; export DMLC_ROLE=server; cd /root/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 1.
````
envs:
```env
anaconda 3
ssh - no passwd
```
mxnet build cmd:
```bash
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=1
```
i don't know how to solve it, please help me fix it..
thanks alot
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services