You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/18 16:49:59 UTC

[GitHub] [incubator-mxnet] yuxihu commented on issue #15578: MXNet cu100 nightly release breaks Horovod integration tests

yuxihu commented on issue #15578: MXNet cu100 nightly release breaks Horovod integration tests
URL: https://github.com/apache/incubator-mxnet/issues/15578#issuecomment-512898093
 
 
   Got into another issue:
   
   mxnet-cu100==1.5.0b20190717
   ```
   (mxnet_p36) ubuntu@ip-172-31-21-35:~$ pip install mxnet-cu100 --pre
   Collecting mxnet-cu100
     Downloading https://files.pythonhosted.org/packages/b7/c4/1e0c4d21ed6dca0890a9db3e23f78c9d8df1651330547c7778aa1a974ea6/mxnet_cu100-1.5.0b20190717-py2.py3-none-manylinux1_x86_64.whl (540.1MB)
       100% |████████████████████████████████| 540.1MB 136kB/s
   Requirement already satisfied: numpy<2.0.0,>1.16.0 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (1.17.0rc2)
   Requirement already satisfied: requests<3,>=2.20.0 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (2.20.0)
   Requirement already satisfied: graphviz<0.9.0,>=0.8.1 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (0.8.4)
   Requirement already satisfied: chardet<3.1.0,>=3.0.2 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (3.0.4)
   Requirement already satisfied: idna<2.8,>=2.5 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (2.6)
   Requirement already satisfied: urllib3<1.25,>=1.21.1 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (1.23)
   Requirement already satisfied: certifi>=2017.4.17 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (2019.3.9)
   Installing collected packages: mxnet-cu100
   Successfully installed mxnet-cu100-1.5.0b20190717
   You are using pip version 10.0.1, however version 19.1.1 is available.
   You should consider upgrading via the 'pip install --upgrade pip' command.
   ```
   Horovod master branch
   ```
   (mxnet_p36) ubuntu@ip-172-31-21-35:~$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir ~/horovod
   Processing ./horovod
   Requirement already satisfied: cloudpickle in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (0.5.3)
   Requirement already satisfied: psutil in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (5.4.5)
   Requirement already satisfied: six in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (1.11.0)
   Installing collected packages: horovod
     Found existing installation: horovod 0.16.4
       Uninstalling horovod-0.16.4:
         Successfully uninstalled horovod-0.16.4
     Running setup.py install for horovod ... done
   Successfully installed horovod-0.16.4
   You are using pip version 10.0.1, however version 19.1.1 is available.
   You should consider upgrading via the 'pip install --upgrade pip' command.
   ```
   Unit test failed due to undefined symbol introduced in #15551.
   ```
   (mxnet_p36) ubuntu@ip-172-31-21-35:~$ ~/anaconda3/envs/mxnet_p36/bin/mpirun -np 2 -H localhost:2 pytest horovod/test/test_mxnet.py
   ============================= test session starts ==============================
   platform linux -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
   ============================= test session starts ==============================
   platform linux -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
   rootdir: /home/ubuntu/horovod, inifile:
   rootdir: /home/ubuntu/horovod, inifile:
   plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
   plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
   collected 0 items / 1 errors
   collected 0 items / 1 errors
   
   ==================================== ERRORS ====================================
   _____________________ ERROR collecting test/test_mxnet.py ______________________
   horovod/test/test_mxnet.py:20: in <module>
       import horovod.mxnet as hvd
   anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py:25: in <module>
       from horovod.mxnet.mpi_ops import allgather
   anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py:29: in <module>
       _basics = _HorovodBasics(__file__, 'mpi_lib')
   anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/common/basics.py:27: in __init__
       self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
   anaconda3/envs/mxnet_p36/lib/python3.6/ctypes/__init__.py:348: in __init__
       self._handle = _dlopen(self._name, mode)
   E   OSError: /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet7Context13CudaLibChecksEv
   !!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
   =========================== 1 error in 1.52 seconds ============================
   
   ==================================== ERRORS ====================================
   _____________________ ERROR collecting test/test_mxnet.py ______________________
   horovod/test/test_mxnet.py:20: in <module>
       import horovod.mxnet as hvd
   anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py:25: in <module>
       from horovod.mxnet.mpi_ops import allgather
   anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py:29: in <module>
       _basics = _HorovodBasics(__file__, 'mpi_lib')
   anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/common/basics.py:27: in __init__
       self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
   anaconda3/envs/mxnet_p36/lib/python3.6/ctypes/__init__.py:348: in __init__
       self._handle = _dlopen(self._name, mode)
   E   OSError: /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet7Context13CudaLibChecksEv
   !!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
   =========================== 1 error in 1.54 seconds ============================
   -------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   -------------------------------------------------------
   --------------------------------------------------------------------------
   mpirun detected that one or more processes exited with non-zero status, thus causing
   the job to be terminated. The first process to do so was:
   
     Process name: [[34828,1],0]
     Exit code:    2
   --------------------------------------------------------------------------
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services