You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/02/11 09:00:18 UTC

[GitHub] TPchanger opened a new issue #14114: distributed training port bind error

TPchanger opened a new issue #14114: distributed training  port bind error
URL: https://github.com/apache/incubator-mxnet/issues/14114
 
 
   Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
   
   For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io 
   
   ## Description
   I am trying distributed training on two ubuntu server. Both of them have one GPU,but this may not be the problem.
   
   I installed mxnet-cu90 with pip. and I also git cloned mxnet(https://github.com/apache/incubator-mxnet) to my home directory.
   
   The command is simple “~/incubator-mxnet/tools/launch.py -H host -n 2 python3 store.py”
   or  “~/incubator-mxnet/tools/launch.py -H host -n 2 python3 image-classificatioin.py” with some other network config command.
   
   host
   "
   server1
   server2
   "
   both of them are sshable without password
   ## Environment info (Required)
   two Ubuntu16.04 with one GPU
   ```
   
   
   ## Error Message:
   Traceback (most recent call last):
   File “store.py”, line 3, in 
   store = kv.create(‘dist’)
   File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore.py”, line 674, in create
   ctypes.byref(handle)))
   File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
   raise MXNetError(py_str(LIB.MXGetLastError()))
   mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node.port) != (-1) bind failed
   ## Minimum reproducible example
   (If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
   
   ## Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1. “~/incubator-mxnet/tools/launch.py -H host -n 2 python3 store.py”
   2.~/incubator-mxnet/tools/launch.py -H host -n 2 python3 image-classificatioin.py
   
   ## What have you tried to solve it?
   
   1. https://stackoverflow.com/questions/6024003/why-doesnt-zeromq-work-on-localhost  I cant find similar code in my example.
   
   
   ## store.py code ##
   from mxnet import kv, nd
   store = kv.create('dist')
   shape = (2, 3)
   x = nd.random_uniform(shape=shape)
   store.init('weight', x)
   print('=== init "weight" ==={}'.format(x))
   from mxnet import gpu,cpu
   ctx = [gpu(0), cpu(0)]
   y = [nd.zeros(shape, ctx=c) for c in ctx]
   store.pull('weight', out=y)
   print('=== pull "weight" to {} ===\n{}'.format(ctx, y))
   ~

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services