You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/18 16:49:59 UTC
[GitHub] [incubator-mxnet] yuxihu commented on issue #15578: MXNet cu100
nightly release breaks Horovod integration tests
yuxihu commented on issue #15578: MXNet cu100 nightly release breaks Horovod integration tests
URL: https://github.com/apache/incubator-mxnet/issues/15578#issuecomment-512898093
Got into another issue:
mxnet-cu100==1.5.0b20190717
```
(mxnet_p36) ubuntu@ip-172-31-21-35:~$ pip install mxnet-cu100 --pre
Collecting mxnet-cu100
Downloading https://files.pythonhosted.org/packages/b7/c4/1e0c4d21ed6dca0890a9db3e23f78c9d8df1651330547c7778aa1a974ea6/mxnet_cu100-1.5.0b20190717-py2.py3-none-manylinux1_x86_64.whl (540.1MB)
100% |████████████████████████████████| 540.1MB 136kB/s
Requirement already satisfied: numpy<2.0.0,>1.16.0 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (1.17.0rc2)
Requirement already satisfied: requests<3,>=2.20.0 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (2.20.0)
Requirement already satisfied: graphviz<0.9.0,>=0.8.1 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (0.8.4)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (3.0.4)
Requirement already satisfied: idna<2.8,>=2.5 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (2.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (1.23)
Requirement already satisfied: certifi>=2017.4.17 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (2019.3.9)
Installing collected packages: mxnet-cu100
Successfully installed mxnet-cu100-1.5.0b20190717
You are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
```
Horovod master branch
```
(mxnet_p36) ubuntu@ip-172-31-21-35:~$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir ~/horovod
Processing ./horovod
Requirement already satisfied: cloudpickle in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (0.5.3)
Requirement already satisfied: psutil in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (5.4.5)
Requirement already satisfied: six in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (1.11.0)
Installing collected packages: horovod
Found existing installation: horovod 0.16.4
Uninstalling horovod-0.16.4:
Successfully uninstalled horovod-0.16.4
Running setup.py install for horovod ... done
Successfully installed horovod-0.16.4
You are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
```
Unit test failed due to undefined symbol introduced in #15551.
```
(mxnet_p36) ubuntu@ip-172-31-21-35:~$ ~/anaconda3/envs/mxnet_p36/bin/mpirun -np 2 -H localhost:2 pytest horovod/test/test_mxnet.py
============================= test session starts ==============================
platform linux -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
============================= test session starts ==============================
platform linux -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
rootdir: /home/ubuntu/horovod, inifile:
rootdir: /home/ubuntu/horovod, inifile:
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
collected 0 items / 1 errors
collected 0 items / 1 errors
==================================== ERRORS ====================================
_____________________ ERROR collecting test/test_mxnet.py ______________________
horovod/test/test_mxnet.py:20: in <module>
import horovod.mxnet as hvd
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py:25: in <module>
from horovod.mxnet.mpi_ops import allgather
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py:29: in <module>
_basics = _HorovodBasics(__file__, 'mpi_lib')
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/common/basics.py:27: in __init__
self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
anaconda3/envs/mxnet_p36/lib/python3.6/ctypes/__init__.py:348: in __init__
self._handle = _dlopen(self._name, mode)
E OSError: /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet7Context13CudaLibChecksEv
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 1.52 seconds ============================
==================================== ERRORS ====================================
_____________________ ERROR collecting test/test_mxnet.py ______________________
horovod/test/test_mxnet.py:20: in <module>
import horovod.mxnet as hvd
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py:25: in <module>
from horovod.mxnet.mpi_ops import allgather
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py:29: in <module>
_basics = _HorovodBasics(__file__, 'mpi_lib')
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/common/basics.py:27: in __init__
self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
anaconda3/envs/mxnet_p36/lib/python3.6/ctypes/__init__.py:348: in __init__
self._handle = _dlopen(self._name, mode)
E OSError: /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet7Context13CudaLibChecksEv
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 1.54 seconds ============================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[34828,1],0]
Exit code: 2
--------------------------------------------------------------------------
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services