You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/23 00:39:32 UTC
[GitHub] [incubator-mxnet] eric-haibin-lin opened a new issue #18772: horovod seg-fault with mxnet pip wheels
eric-haibin-lin opened a new issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772
I am working on a bug fix for mxnet master with my horovod branch: https://github.com/eric-haibin-lin/horovod/tree/mx2
I noticed that the example passes if I use mxnet built from source:
```
# install mxnet
git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
cd mxnet
cp config/linux.cmake config.cmake
rm -rf build
mkdir -p build && cd build
cmake -GNinja ..
cmake --build . --parallel 48
cd ../python; python setup develop --user;
cd ./mxnet; ln -s ../../include include; ln -s ../../3rdparty 3rdparty;
# install horovod
cd horovod; python setup.py install --user;
# run example
cd example; horovodrun -np 2 mxnet2_mnist.py
```
However, it segfault immediate after the first broadcast call if I use the mxnet nightly pip wheel from https://repo.mxnet.io/dist/python such as:
https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
```
----------Python Info----------
Version : 3.7.6
Compiler : GCC 7.3.1 20180712 (Red Hat 7.3.1-6)
Build : ('default', 'Feb 26 2020 20:54:15')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 20.1.1
Directory : /home/ec2-user/.local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version : 2.0.0
Directory : /home/ec2-user/src/mxnet/python/mxnet
Num GPUs : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.14.173-137.229.amzn2.x86_64-x86_64-with-glibc2.2.5
system : Linux
node : ip-172-31-81-80.ec2.internal
release : 4.14.173-137.229.amzn2.x86_64
version : #1 SMP Wed Apr 1 18:06:08 UTC 2020
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping: 7
CPU MHz: 1208.761
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? ~~Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?~~
I tried the following steps to compile in the container and it works fine. I think we can conclude that there is an ABI mismatch between the compiler used in the gcc7 provided by CentOS7 https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ and the compiler you used in AL2.
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
cd /usr/local/src
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
tar xf openmpi-4.0.4.tar.gz
./configure --prefix=/usr/local
make all install -j$(nproc)
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
cd examples
yum install openssh-server
systemctl start sshd
/root/.local/bin/horovodrun -np 2 python /mnt/horovod/examples/mxnet2_mnist.py
```
Output
```
Thu Jul 23 21:04:17 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 100] Training: accuracy=0.860938
Thu Jul 23 21:04:17 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 100] Training: accuracy=0.853594
Thu Jul 23 21:04:18 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 200] Training: accuracy=0.908203
Thu Jul 23 21:04:18 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 200] Training: accuracy=0.913125
Thu Jul 23 21:04:19 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 300] Training: accuracy=0.933281
Thu Jul 23 21:04:19 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 300] Training: accuracy=0.930937
Thu Jul 23 21:04:20 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 400] Training: accuracy=0.942305
Thu Jul 23 21:04:20 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 400] Training: accuracy=0.943477
Thu Jul 23 21:04:20 2020[0]<stderr>:INFO:root:Epoch[0] Speed=15403.68 samples/s Time cost=3.888941
Thu Jul 23 21:04:21 2020[0]<stderr>:INFO:root:Epoch[0] Train: accuracy=0.947683 Validation: accuracy=0.981370
Thu Jul 23 21:04:22 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 100] Training: accuracy=0.982031
Thu Jul 23 21:04:22 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 100] Training: accuracy=0.980938
Thu Jul 23 21:04:23 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 200] Training: accuracy=0.984453
Thu Jul 23 21:04:23 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 200] Training: accuracy=0.982266
Thu Jul 23 21:04:24 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 300] Training: accuracy=0.985000
Thu Jul 23 21:04:24 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 300] Training: accuracy=0.983958
Thu Jul 23 21:04:25 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 400] Training: accuracy=0.984883
Thu Jul 23 21:04:25 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 400] Training: accuracy=0.983828
Thu Jul 23 21:04:25 2020[0]<stderr>:INFO:root:Epoch[1] Speed=14106.52 samples/s Time cost=4.246548
Thu Jul 23 21:04:26 2020[0]<stderr>:INFO:root:Epoch[1] Train: accuracy=0.985443 Validation: accuracy=0.985877
Thu Jul 23 21:04:27 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 100] Training: accuracy=0.988594
Thu Jul 23 21:04:27 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 100] Training: accuracy=0.987656
Thu Jul 23 21:04:28 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 200] Training: accuracy=0.989922
Thu Jul 23 21:04:28 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 200] Training: accuracy=0.988125
Thu Jul 23 21:04:29 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 300] Training: accuracy=0.989948
Thu Jul 23 21:04:29 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 300] Training: accuracy=0.988958
Thu Jul 23 21:04:30 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 400] Training: accuracy=0.989805
Thu Jul 23 21:04:30 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 400] Training: accuracy=0.989062
Thu Jul 23 21:04:30 2020[0]<stderr>:INFO:root:Epoch[2] Speed=14098.05 samples/s Time cost=4.249099
Thu Jul 23 21:04:31 2020[0]<stderr>:INFO:root:Epoch[2] Train: accuracy=0.990051 Validation: accuracy=0.988181
Thu Jul 23 21:04:32 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 100] Training: accuracy=0.993281
Thu Jul 23 21:04:32 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 100] Training: accuracy=0.990625
Thu Jul 23 21:04:33 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 200] Training: accuracy=0.993359
Thu Jul 23 21:04:33 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 200] Training: accuracy=0.991172
Thu Jul 23 21:04:34 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 300] Training: accuracy=0.991927
Thu Jul 23 21:04:34 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 300] Training: accuracy=0.993125
Thu Jul 23 21:04:35 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 400] Training: accuracy=0.993008
Thu Jul 23 21:04:35 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 400] Training: accuracy=0.992031
Thu Jul 23 21:04:35 2020[0]<stderr>:INFO:root:Epoch[3] Speed=14035.98 samples/s Time cost=4.267888
Thu Jul 23 21:04:36 2020[0]<stderr>:INFO:root:Epoch[3] Train: accuracy=0.993323 Validation: accuracy=0.989984
Thu Jul 23 21:04:37 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 100] Training: accuracy=0.995625
Thu Jul 23 21:04:37 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 100] Training: accuracy=0.994219
Thu Jul 23 21:04:38 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 200] Training: accuracy=0.995000
Thu Jul 23 21:04:38 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 200] Training: accuracy=0.996250
Thu Jul 23 21:04:39 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 300] Training: accuracy=0.995260
Thu Jul 23 21:04:39 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 300] Training: accuracy=0.995313
Thu Jul 23 21:04:40 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 400] Training: accuracy=0.995039
Thu Jul 23 21:04:40 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 400] Training: accuracy=0.995195
Thu Jul 23 21:04:40 2020[0]<stderr>:INFO:root:Epoch[4] Speed=14055.33 samples/s Time cost=4.262014
Thu Jul 23 21:04:41 2020[0]<stderr>:INFO:root:Epoch[4] Train: accuracy=0.995493 Validation: accuracy=0.991486
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive https://github.com/eric-haibin-lin/horovod.git
cd horovod
git checkout mx2
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install -y openmpi-devel
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu commented on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-663235192
We may want to remove the C++ API headers from the pip package, to prevent anyone from relying on C++ ABI by mistake.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? ~~Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?~~
I tried the following steps to compile in the container and it works fine. I think we can conclude that there is an ABI mismatch between the compiler used in the gcc7 provided by CentOS7 https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ and the compiler you used in AL2.
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
cd /usr/local/src
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
tar xf openmpi-4.0.4.tar.gz
./configure --prefix=/usr/local
make all install -j$(nproc)
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
cd examples
yum install openssh-server
systemctl start sshd
/root/.local/bin/horovodrun -np 2 python /mnt/horovod/examples/mxnet2_mnist.py
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
mkdir ~/horovod
docker run --gpus=all -v /home/ubuntu/horovod/:/mnt -it mxnetci/build.centos7_gpu_cu102 /bin/bash
cd /mnt
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive https://github.com/eric-haibin-lin/horovod.git
cd horovod
git checkout mx2
git submodule update --init --recursive
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-663264152
Thanks for the investigation and good catch about the c++ headers. I agree. We need to rewrite the integration code using only the c APIs to avoid ABI compatibility issues
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive https://github.com/eric-haibin-lin/horovod.git
cd horovod
git checkout mx2
git submodule update --init --recursive
pip install https://repo.mxnet.io/dist/python/cpu/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu commented on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod isn't compiled with the same ABI as the MXNet binary wheel. Thus it crashes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-663235192
We may want to remove the C++ API headers from the pip package, to prevent anyone from relying on C++ ABI by mistake. I think as soon as someone uses the C++ API headers to create C++ objects in their library and then passes them to the libmxnet.so via some C API or even Python API, there can be an ABI mismatch causing crash.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662772210
```
[1,0]<stdout>:(gdb) bt
[1,0]<stdout>:#0 0x00007ffff7419b80 in pthread_mutex_lock () from /lib64/libpthread.so.0
[1,0]<stdout>:#1 0x00007fff68a1b81d in mxnet::engine::ThreadedVar::AppendWriteDependency(mxnet::engine::OprBlock*) ()
[1,0]<stdout>: from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#2 0x00007fff68a176ff in mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool) ()
[1,0]<stdout>: from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#3 0x00007fff68a147a7 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) ()
[1,0]<stdout>: from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#4 0x00007fff688f5f42 in MXEnginePushAsync ()
[1,0]<stdout>: from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#5 0x00007ffdcc11ace9 in horovod::mxnet::PushHorovodOperation (
[1,0]<stdout>: op_type=op_type@entry=horovod::common::Request::BROADCAST,
[1,0]<stdout>: input=input@entry=0x182fb90, output=output@entry=0x182fb90,
[1,0]<stdout>: name=name@entry=0x7ffdd5e63f20 "0.bias", priority=priority@entry=0,
[1,0]<stdout>: root_rank=root_rank@entry=0) at horovod/mxnet/mpi_ops.cc:138
[1,0]<stdout>:#6 0x00007ffdcc116010 in horovod::mxnet::horovod_mxnet_broadcast_async (
[1,0]<stdout>: input=0x182fb90, output=0x182fb90, name=0x7ffdd5e63f20 "0.bias",
[1,0]<stdout>: root_rank=0, priority=0) at horovod/mxnet/mpi_ops.cc:301
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run -it mxnetci/build.centos7_gpu_cu102 /bin/bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive https://github.com/eric-haibin-lin/horovod.git
cd horovod
git checkout mx2
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive https://github.com/eric-haibin-lin/horovod.git
cd horovod
git checkout mx2
git submodule update --init --recursive
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install -y openmpi-devel
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
cd examples
PATH=/usr/lib64/openmpi/bin/:$PATH /root/local/.bin/horovodrun -np 2 mxnet2_mnist.py
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
cd /usr/local/src
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
tar xf openmpi-4.0.4.tar.gz
./configure --prefix=/usr/local
make all install -j$(nproc)
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
cd examples
/root/.local/bin/horovodrun -np 2 mxnet2_mnist.py
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install openmpi-devel
git clone --recursive https://github.com/eric-haibin-lin/horovod.git
cd horovod
git checkout mx2
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 python setup.py install --user
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on issue #18772: horovod seg-fault with mxnet pip wheels
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18772:
URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine:
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?
I tried the following steps to compile in the container:
```
docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
```
```
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
yum install -y openmpi-devel
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user
cd examples
PATH=/usr/lib64/openmpi/bin/:$PATH /root/.local/bin/horovodrun -np 2 mxnet2_mnist.py
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org