You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/23 00:39:32 UTC

[GitHub] [incubator-mxnet] eric-haibin-lin opened a new issue #18772: horovod seg-fault with mxnet pip wheels

eric-haibin-lin opened a new issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772


   I am working on a bug fix for mxnet master with my horovod branch: https://github.com/eric-haibin-lin/horovod/tree/mx2 
   
   I noticed that the example passes if I use mxnet built from source:
   ```
   # install mxnet 
   git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
   cd mxnet
   cp config/linux.cmake config.cmake
   rm -rf build
   mkdir -p build && cd build
   cmake -GNinja ..
   cmake --build . --parallel 48
   cd ../python; python setup develop --user; 
   cd ./mxnet; ln -s ../../include include; ln -s ../../3rdparty 3rdparty; 
   
   # install horovod 
   cd horovod; python setup.py install --user; 
   
   # run example 
   cd example; horovodrun -np 2 mxnet2_mnist.py 
   ```
   
   However, it segfault immediate after the first broadcast call if I use the mxnet nightly pip wheel from https://repo.mxnet.io/dist/python such as: 
   https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl 
   
   
   ```
   ----------Python Info----------
   Version      : 3.7.6
   Compiler     : GCC 7.3.1 20180712 (Red Hat 7.3.1-6)
   Build        : ('default', 'Feb 26 2020 20:54:15')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 20.1.1
   Directory    : /home/ec2-user/.local/lib/python3.7/site-packages/pip
   ----------MXNet Info-----------
   Version      : 2.0.0
   Directory    : /home/ec2-user/src/mxnet/python/mxnet
   Num GPUs     : 0
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Linux-4.14.173-137.229.amzn2.x86_64-x86_64-with-glibc2.2.5
   system       : Linux
   node         : ip-172-31-81-80.ec2.internal
   release      : 4.14.173-137.229.amzn2.x86_64
   version      : #1 SMP Wed Apr 1 18:06:08 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              48
   On-line CPU(s) list: 0-47
   Thread(s) per core:  2
   Core(s) per socket:  24
   Socket(s):           1
   NUMA node(s):        1
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               85
   Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
   Stepping:            7
   CPU MHz:             1208.761
   BogoMIPS:            4999.99
   Hypervisor vendor:   KVM
   Virtualization type: full
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            1024K
   L3 cache:            36608K
   NUMA node0 CPU(s):   0-47
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? ~~Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?~~
   
   I tried the following steps to compile in the container and it works fine. I think we can conclude that there is an ABI mismatch between the compiler used in the gcc7 provided by CentOS7 https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ and the compiler you used in AL2.
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   cd /usr/local/src
   wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
   tar xf openmpi-4.0.4.tar.gz
   ./configure --prefix=/usr/local
   make all install -j$(nproc)
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   
   cd examples
   yum install openssh-server
   systemctl start sshd
   /root/.local/bin/horovodrun -np 2 python /mnt/horovod/examples/mxnet2_mnist.py
   ```
   
   Output
   
   ```
   Thu Jul 23 21:04:17 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 100] Training: accuracy=0.860938
   Thu Jul 23 21:04:17 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 100] Training: accuracy=0.853594
   Thu Jul 23 21:04:18 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 200] Training: accuracy=0.908203
   Thu Jul 23 21:04:18 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 200] Training: accuracy=0.913125
   Thu Jul 23 21:04:19 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 300] Training: accuracy=0.933281
   Thu Jul 23 21:04:19 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 300] Training: accuracy=0.930937
   Thu Jul 23 21:04:20 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 400] Training: accuracy=0.942305
   Thu Jul 23 21:04:20 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 400] Training: accuracy=0.943477
   Thu Jul 23 21:04:20 2020[0]<stderr>:INFO:root:Epoch[0]  Speed=15403.68 samples/s        Time cost=3.888941
   Thu Jul 23 21:04:21 2020[0]<stderr>:INFO:root:Epoch[0]  Train: accuracy=0.947683        Validation: accuracy=0.981370
   Thu Jul 23 21:04:22 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 100] Training: accuracy=0.982031
   Thu Jul 23 21:04:22 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 100] Training: accuracy=0.980938
   Thu Jul 23 21:04:23 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 200] Training: accuracy=0.984453
   Thu Jul 23 21:04:23 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 200] Training: accuracy=0.982266
   Thu Jul 23 21:04:24 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 300] Training: accuracy=0.985000
   Thu Jul 23 21:04:24 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 300] Training: accuracy=0.983958
   Thu Jul 23 21:04:25 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 400] Training: accuracy=0.984883
   Thu Jul 23 21:04:25 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 400] Training: accuracy=0.983828
   Thu Jul 23 21:04:25 2020[0]<stderr>:INFO:root:Epoch[1]  Speed=14106.52 samples/s        Time cost=4.246548
   Thu Jul 23 21:04:26 2020[0]<stderr>:INFO:root:Epoch[1]  Train: accuracy=0.985443        Validation: accuracy=0.985877
   Thu Jul 23 21:04:27 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 100] Training: accuracy=0.988594
   Thu Jul 23 21:04:27 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 100] Training: accuracy=0.987656
   Thu Jul 23 21:04:28 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 200] Training: accuracy=0.989922
   Thu Jul 23 21:04:28 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 200] Training: accuracy=0.988125
   Thu Jul 23 21:04:29 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 300] Training: accuracy=0.989948
   Thu Jul 23 21:04:29 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 300] Training: accuracy=0.988958
   Thu Jul 23 21:04:30 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 400] Training: accuracy=0.989805
   Thu Jul 23 21:04:30 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 400] Training: accuracy=0.989062
   Thu Jul 23 21:04:30 2020[0]<stderr>:INFO:root:Epoch[2]  Speed=14098.05 samples/s        Time cost=4.249099
   Thu Jul 23 21:04:31 2020[0]<stderr>:INFO:root:Epoch[2]  Train: accuracy=0.990051        Validation: accuracy=0.988181
   Thu Jul 23 21:04:32 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 100] Training: accuracy=0.993281
   Thu Jul 23 21:04:32 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 100] Training: accuracy=0.990625
   Thu Jul 23 21:04:33 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 200] Training: accuracy=0.993359
   Thu Jul 23 21:04:33 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 200] Training: accuracy=0.991172
   Thu Jul 23 21:04:34 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 300] Training: accuracy=0.991927
   Thu Jul 23 21:04:34 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 300] Training: accuracy=0.993125
   Thu Jul 23 21:04:35 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 400] Training: accuracy=0.993008
   Thu Jul 23 21:04:35 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 400] Training: accuracy=0.992031
   Thu Jul 23 21:04:35 2020[0]<stderr>:INFO:root:Epoch[3]  Speed=14035.98 samples/s        Time cost=4.267888
   Thu Jul 23 21:04:36 2020[0]<stderr>:INFO:root:Epoch[3]  Train: accuracy=0.993323        Validation: accuracy=0.989984
   Thu Jul 23 21:04:37 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 100] Training: accuracy=0.995625
   Thu Jul 23 21:04:37 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 100] Training: accuracy=0.994219
   Thu Jul 23 21:04:38 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 200] Training: accuracy=0.995000
   Thu Jul 23 21:04:38 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 200] Training: accuracy=0.996250
   Thu Jul 23 21:04:39 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 300] Training: accuracy=0.995260
   Thu Jul 23 21:04:39 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 300] Training: accuracy=0.995313
   Thu Jul 23 21:04:40 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 400] Training: accuracy=0.995039
   Thu Jul 23 21:04:40 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 400] Training: accuracy=0.995195
   Thu Jul 23 21:04:40 2020[0]<stderr>:INFO:root:Epoch[4]  Speed=14055.33 samples/s        Time cost=4.262014
   Thu Jul 23 21:04:41 2020[0]<stderr>:INFO:root:Epoch[4]  Train: accuracy=0.995493        Validation: accuracy=0.991486
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install -y openmpi-devel
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-663235192


   We may want to remove the C++ API headers from the pip package, to prevent anyone from relying on C++ ABI by mistake.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? ~~Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?~~
   
   I tried the following steps to compile in the container and it works fine. I think we can conclude that there is an ABI mismatch between the compiler used in the gcc7 provided by CentOS7 https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ and the compiler you used in AL2.
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   cd /usr/local/src
   wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
   tar xf openmpi-4.0.4.tar.gz
   ./configure --prefix=/usr/local
   make all install -j$(nproc)
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   
   cd examples
   yum install openssh-server
   systemctl start sshd
   /root/.local/bin/horovodrun -np 2 python /mnt/horovod/examples/mxnet2_mnist.py
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   mkdir ~/horovod
   docker run --gpus=all -v /home/ubuntu/horovod/:/mnt -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   cd /mnt
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   git submodule update --init --recursive
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-663264152


   Thanks for the investigation and good catch about the c++ headers. I agree. We need to rewrite the integration code using only the c APIs to avoid ABI compatibility issues


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   git submodule update --init --recursive
   pip install https://repo.mxnet.io/dist/python/cpu/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod isn't compiled with the same ABI as the MXNet binary wheel. Thus it crashes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-663235192


   We may want to remove the C++ API headers from the pip package, to prevent anyone from relying on C++ ABI by mistake. I think as soon as someone uses the C++ API headers to create C++ objects in their library and then passes them to the libmxnet.so via some C API or even Python API, there can be an ABI mismatch causing crash.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662772210


   ```
   [1,0]<stdout>:(gdb) bt
   [1,0]<stdout>:#0  0x00007ffff7419b80 in pthread_mutex_lock () from /lib64/libpthread.so.0
   [1,0]<stdout>:#1  0x00007fff68a1b81d in mxnet::engine::ThreadedVar::AppendWriteDependency(mxnet::engine::OprBlock*) ()
   [1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
   [1,0]<stdout>:#2  0x00007fff68a176ff in mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool) ()
   [1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
   [1,0]<stdout>:#3  0x00007fff68a147a7 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) ()
   [1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
   [1,0]<stdout>:#4  0x00007fff688f5f42 in MXEnginePushAsync ()
   [1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
   [1,0]<stdout>:#5  0x00007ffdcc11ace9 in horovod::mxnet::PushHorovodOperation (
   [1,0]<stdout>:    op_type=op_type@entry=horovod::common::Request::BROADCAST,
   [1,0]<stdout>:    input=input@entry=0x182fb90, output=output@entry=0x182fb90,
   [1,0]<stdout>:    name=name@entry=0x7ffdd5e63f20 "0.bias", priority=priority@entry=0,
   [1,0]<stdout>:    root_rank=root_rank@entry=0) at horovod/mxnet/mpi_ops.cc:138
   [1,0]<stdout>:#6  0x00007ffdcc116010 in horovod::mxnet::horovod_mxnet_broadcast_async (
   [1,0]<stdout>:    input=0x182fb90, output=0x182fb90, name=0x7ffdd5e63f20 "0.bias",
   [1,0]<stdout>:    root_rank=0, priority=0) at horovod/mxnet/mpi_ops.cc:301
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   git submodule update --init --recursive
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install -y openmpi-devel
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   
   cd examples
   PATH=/usr/lib64/openmpi/bin/:$PATH /root/local/.bin/horovodrun -np 2 mxnet2_mnist.py
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   cd /usr/local/src
   wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
   tar xf openmpi-4.0.4.tar.gz
   ./configure --prefix=/usr/local
   make all install -j$(nproc)
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   
   cd examples
   /root/.local/bin/horovodrun -np 2 mxnet2_mnist.py
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 python setup.py install --user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the Engine: 
   
   https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
   ```
   
   ```
   docker container list
   docker container exec -it aa5253f2282f bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install -y openmpi-devel
   git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
   
   cd examples
   PATH=/usr/lib64/openmpi/bin/:$PATH /root/.local/bin/horovodrun -np 2 mxnet2_mnist.py
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org