You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/10 04:46:01 UTC

[GitHub] [incubator-mxnet] eric-haibin-lin opened a new issue #18014: enabling mkldnn leads to segfault in bytePS

eric-haibin-lin opened a new issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014
 
 
   I am using `bytepsimage/mxnet` docker image to test bytePS. However, I find recent MXNet versions lead to segfault. 
   
   Specifically, if i build mxnet-cu100 variant with commit b6b1de092b2bbc6ab7207a98dcb1c08fe67ca14b, the following command works:
   ```
   docker pull bytepsimage/mxnet
   
   nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/mxnet bash
   
   # now you are in docker environment
   export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # gpus list
   export DMLC_WORKER_ID=0 # your worker id
   export DMLC_NUM_WORKER=1 # one worker
   export DMLC_ROLE=worker 
   
   # the following value does not matter for non-distributed jobs 
   export DMLC_NUM_SERVER=1 
   export DMLC_PS_ROOT_URI=10.0.0.1 
   export DMLC_PS_ROOT_PORT=1234 
   
   bpslaunch python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py
   
   ```
   
   However, if i use commit 2f6cdd383abbf46a37b84a5fad013726b5c62169, it gives me segfault. 
   I used `source tools/staticbuild/build.sh mxnet-cu100 pip` to build the pip package. 
   
   And if I use the latest nightly build, it also gives me segfault 
   
   @TaoLv any idea why?
   
   related issue https://github.com/bytedance/byteps/issues/222 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] pengzhao-intel commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
pengzhao-intel commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-611876912
 
 
   Is there any test case for bytePS to cover this kind of issue?   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
TaoLv edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612352091
 
 
   @eric-haibin-lin @ymjiang I still can't see anything related to MKL-DNN operators from the back trace. Agree that there might be some incompatibility issues between MKL-DNN integration and later BytePS integration.
   BTW, have you ever tried NaiveEngine? Can we have some clues from there?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] xinyu-intel commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
xinyu-intel commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612375492
 
 
   Build latest MXNet w/o MKLDNN also encounter this issue:
   ```
   cmake -DCMAKE_BUILD_TYPE=Debug -DUSE_MKL_IF_AVAILABLE=OFF -DUSE_CUDA=ON -DUSE_MKLDNN=OFF -G Ninja ..
   ```
   ```
   Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
   __GI___pthread_mutex_lock (mutex=0x3a6772617f) at ../nptl/pthread_mutex_lock.c:65
   65	../nptl/pthread_mutex_lock.c: No such file or directory.
   #0  __GI___pthread_mutex_lock (mutex=0x3a6772617f) at ../nptl/pthread_mutex_lock.c:65
   #1  0x00007fa43648b65b in __gthread_mutex_lock (__mutex=0x3a6772617f) at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:748
   #2  0x00007fa4364adf3a in std::mutex::lock (this=0x3a6772617f) at /usr/include/c++/7/bits/std_mutex.h:103
   #3  0x00007fa4364c5bf4 in std::lock_guard<std::mutex>::lock_guard (this=0x7ffd3340e270, __m=...) at /usr/include/c++/7/bits/std_mutex.h:162
   #4  0x00007fa4366b2d6e in mxnet::engine::ThreadedVar::AppendWriteDependency (this=0x3a6772615f, opr_block=0x2f08190) at ../src/engine/threaded_engine.cc:74
   #5  0x00007fa4366af4f7 in mxnet::engine::ThreadedEngine::Push (this=0x2f053a0, op=0x2f06630, exec_ctx=..., priority=0, profiling=false) at ../src/engine/threaded_engine.cc:311
   #6  0x00007fa4366af924 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) (this=0x2f053a0, fn=..., exec_ctx=..., const_vars=std::vector of length 0, capacity 0, mutable_vars=std::vector of length 1, capacity 1 = {...}, prop=mxnet::FnProperty::kCPUPrioritized, priority=0, opr_name=0x7fa2b16372ec "BytePSPushPull", wait=false) at ../src/engine/threaded_engine.cc:343
   #7  0x00007fa4364a72f6 in MXEnginePushAsync (async_func=0x7fa2b15659f0 <byteps::mxnet::DoPushPull(void*, void*, void*)>, func_param=0x6ee84170, deleter=0x7fa2b1565040 <byteps::mxnet::(anonymous namespace)::DeletePushPullParam(void*)>, ctx_handle=0x7fa2b7ff8a40 <byteps::mxnet::(anonymous namespace)::MX_EXEC_CTX>, const_vars_handle=0x0, num_const_vars=0, mutable_vars_handle=0x7ffd3340e8a8, num_mutable_vars=1, prop_handle=0x7fa2b1637380 <byteps::mxnet::(anonymous namespace)::MX_FUNC_PROP>, priority=0, opr_name=0x7fa2b16372ec "BytePSPushPull", wait=false) at ../src/c_api/c_api.cc:2665
   #8  0x00007fa2b156579d in byteps::mxnet::byteps_mxnet_push_pull_async (tensor=0x6d41f620, name=<optimized out>, version=0, priority=0, is_average=<optimized out>) at byteps/mxnet/ops.cc:116
   #9  0x00007fa50630fdae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #10 0x00007fa50630f71f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #11 0x00007fa5065235c4 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   #12 0x00007fa506523c33 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612531452
 
 
   @xinyu-intel the pip build is built using make. Actually, I'm not sure if there's which version/commit built with cmake that works with BytePS.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612137404
 
 
   With @leezu 's help I built mxnet in debug mode with commit 2f6cdd383abbf46a37b84a5fad013726b5c62169. Here's the stacktrace with line number: 
   ```
   Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
   __GI___pthread_mutex_lock (mutex=0x20) at ../nptl/pthread_mutex_lock.c:65
   65      ../nptl/pthread_mutex_lock.c: No such file or directory.
   #0  __GI___pthread_mutex_lock (mutex=0x20) at ../nptl/pthread_mutex_lock.c:65
   #1  0x00007fff5314d857 in __gthread_mutex_lock (__mutex=0x20) at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:748
   #2  0x00007fff5316b1c6 in std::mutex::lock (this=0x20) at /usr/include/c++/7/bits/std_mutex.h:103
   #3  0x00007fff53182d14 in std::lock_guard<std::mutex>::lock_guard (this=0x7fffffffcb30, __m=...) at /usr/include/c++/7/bits/std_mutex.h:162
   #4  0x00007fff53359852 in mxnet::engine::ThreadedVar::AppendWriteDependency (this=0x0, opr_block=0x294c2a8) at ../src/engine/threaded_engine.cc:74
   #5  0x00007fff53355fd2 in mxnet::engine::ThreadedEngine::Push (this=0x2946590, op=0x294a708, exec_ctx=..., priority=0, profiling=false) at ../src/engine/threaded_engine.cc:311
   #6  0x00007fff53356400 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) (this=0x2946590, fn=..., exec_ctx=..., const_vars=std::vector of length 0, capacity 0, mutable_vars=std::vector of length 1, capacity 1 = {...}, prop=mxnet::FnProperty::kCPUPrioritized, priority=0, opr_name=0x7ffe356374ac "BytePSPushPull", wait=false) at ../src/engine/threaded_engine.cc:343
   #7  0x00007fff53164592 in MXEnginePushAsync (async_func=0x7ffe355663e0 <byteps::mxnet::DoPushPull(void*, void*, void*)>, func_param=0x40f29520, deleter=0x7ffe35565a30 <byteps::mxnet::(anonymous namespace)::DeletePushPullParam(void*)>, ctx_handle=0x7ffe3bff8a40 <byteps::mxnet::(anonymous namespace)::MX_EXEC_CTX>, const_vars_handle=0x0, num_const_vars=0, mutable_vars_handle=0x7fffffffd168, num_mutable_vars=1, prop_handle=0x7ffe35637540 <byteps::mxnet::(anonymous namespace)::MX_FUNC_PROP>, priority=0, opr_name=0x7ffe356374ac "BytePSPushPull", wait=false) at ../src/c_api/c_api.cc:2482
   #8  0x00007ffe3556618d in byteps::mxnet::byteps_mxnet_push_pull_async (tensor=0x409bf190, name=<optimized out>, version=0, priority=0, is_average=<optimized out>) at byteps/mxnet/ops.cc:116
   #9  0x00007ffff65a6dae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #10 0x00007ffff65a671f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #11 0x00007ffff67ba5c4 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   #12 0x00007ffff67bac33 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612531452
 
 
   @xinyu-intel the pip wheel is built using make. Actually, I'm not sure if there's which version/commit built with cmake that works with BytePS.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
TaoLv commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612352091
 
 
   @eric-haibin-lin @ymjiang I still can't see anything related to MKL-DNN operators from the back trace. Agree that there might be some incompatibility issues between MKL-DNN integration and later BytePS integration.
   BTW, have you ever tried NaiveEngine? Can w have some clues from there?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612221267
 
 
   another data point: mxnet-cu100 (1.6.0b20200212) works, but mxnet-cu100mkl (1.6.0b20200212) doesn't 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ymjiang commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
ymjiang commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-611904325
 
 
   Yes, `bytepsimage/mxnet` uses official mxnet, and not nightly version: https://github.com/bytedance/byteps/blob/master/docker/Dockerfile#L46
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612137404
 
 
   With @leezu 's help I built mxnet in debug mode. Here's the stacktrace with line number: 
   ```
   Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
   __GI___pthread_mutex_lock (mutex=0x20) at ../nptl/pthread_mutex_lock.c:65
   65      ../nptl/pthread_mutex_lock.c: No such file or directory.
   #0  __GI___pthread_mutex_lock (mutex=0x20) at ../nptl/pthread_mutex_lock.c:65
   #1  0x00007fff5314d857 in __gthread_mutex_lock (__mutex=0x20) at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:748
   #2  0x00007fff5316b1c6 in std::mutex::lock (this=0x20) at /usr/include/c++/7/bits/std_mutex.h:103
   #3  0x00007fff53182d14 in std::lock_guard<std::mutex>::lock_guard (this=0x7fffffffcb30, __m=...) at /usr/include/c++/7/bits/std_mutex.h:162
   #4  0x00007fff53359852 in mxnet::engine::ThreadedVar::AppendWriteDependency (this=0x0, opr_block=0x294c2a8) at ../src/engine/threaded_engine.cc:74
   #5  0x00007fff53355fd2 in mxnet::engine::ThreadedEngine::Push (this=0x2946590, op=0x294a708, exec_ctx=..., priority=0, profiling=false) at ../src/engine/threaded_engine.cc:311
   #6  0x00007fff53356400 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) (this=0x2946590, fn=..., exec_ctx=..., const_vars=std::vector of length 0, capacity 0, mutable_vars=std::vector of length 1, capacity 1 = {...}, prop=mxnet::FnProperty::kCPUPrioritized, priority=0, opr_name=0x7ffe356374ac "BytePSPushPull", wait=false) at ../src/engine/threaded_engine.cc:343
   #7  0x00007fff53164592 in MXEnginePushAsync (async_func=0x7ffe355663e0 <byteps::mxnet::DoPushPull(void*, void*, void*)>, func_param=0x40f29520, deleter=0x7ffe35565a30 <byteps::mxnet::(anonymous namespace)::DeletePushPullParam(void*)>, ctx_handle=0x7ffe3bff8a40 <byteps::mxnet::(anonymous namespace)::MX_EXEC_CTX>, const_vars_handle=0x0, num_const_vars=0, mutable_vars_handle=0x7fffffffd168, num_mutable_vars=1, prop_handle=0x7ffe35637540 <byteps::mxnet::(anonymous namespace)::MX_FUNC_PROP>, priority=0, opr_name=0x7ffe356374ac "BytePSPushPull", wait=false) at ../src/c_api/c_api.cc:2482
   #8  0x00007ffe3556618d in byteps::mxnet::byteps_mxnet_push_pull_async (tensor=0x409bf190, name=<optimized out>, version=0, priority=0, is_average=<optimized out>) at byteps/mxnet/ops.cc:116
   #9  0x00007ffff65a6dae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #10 0x00007ffff65a671f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #11 0x00007ffff67ba5c4 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   #12 0x00007ffff67bac33 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616048764
 
 
   @TaoLv @ciyongch @xinyu-intel @PatricZhao can someone look into why mxnet-cuxx-mkl build leads to the segfault? This may be a potential blocker for mxnet 1.7. If possible we should also patch it with the mxnet 1.6.1 release

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612126769
 
 
   @ymjiang have you seen this err before? `Error disabling address space randomization: Operation not permitted`. It shows up when I used gdb though. might not be related to the original problem

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616048764
 
 
   > https://repo.mxnet.io/dist/python/cu100/mxnet_cu100-1.6.0-py2.py3-none-manylinux1_x86_64.whl works,
   > https://repo.mxnet.io/dist/python/cu100mkl/mxnet_cu100mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl doesn't.
   > 
   > So, the issue has been there for a while ..
   
   @TaoLv @ciyongch @xinyu-intel @PatricZhao can someone look into why mxnet-cuxx-mkl build leads to the segfault? This may be a potential blocker for mxnet 1.7. If possible we should also patch it with the mxnet 1.6.1 release
   
   cc @szha 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ymjiang commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
ymjiang commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612292050
 
 
   > @ymjiang have you seen this err before? `Error disabling address space randomization: Operation not permitted`. It shows up when I used gdb though. might not be related to the original problem
   > 
   > Update: `Error disabling address space randomization: Operation not permitted` is resolved by `--security-opt seccomp=unconfined`. It does not help the segmentation fault problem
   
   @eric-haibin-lin I am not aware of this before. Is there any version of MXNet-MKL that works? I suspect it might be due to some incompatibility between BytePS and MKL, if none of existing MKL works.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612163137
 
 
   Based on the backtrace, it appears that some mutex may not be initialized correctly leading to segfault?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612126769
 
 
   @ymjiang have you seen this err before? `Error disabling address space randomization: Operation not permitted` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612126604
 
 
   I also tried to build both mxnet and bytePS from source, with gcc 7.4, and the following is the stacktrace:
   ```
   BytePS launching worker
   warning: Error disabling address space randomization: Operation not permitted
   --------------------------------------------------------------------------------
           Layer (type)                                Output Shape         Param #
   ================================================================================
                  Input                              (1, 1, 28, 28)               0
           Activation-1   <Symbol hybridsequential0_conv0_relu_fwd>               0
           Activation-2                             (1, 20, 24, 24)               0
               Conv2D-3                             (1, 20, 24, 24)             520
            MaxPool2D-4                             (1, 20, 12, 12)               0
           Activation-5   <Symbol hybridsequential0_conv1_relu_fwd>               0
           Activation-6                               (1, 50, 8, 8)               0
               Conv2D-7                               (1, 50, 8, 8)           25050
            MaxPool2D-8                               (1, 50, 4, 4)               0
              Flatten-9                                    (1, 800)               0
          Activation-10  <Symbol hybridsequential0_dense0_relu_fwd>               0
          Activation-11                                    (1, 512)               0
               Dense-12                                    (1, 512)          410112
               Dense-13                                     (1, 10)            5130
   ================================================================================
   Parameters in forward computation graph, duplicate included
      Total params: 440812
      Trainable params: 440812
      Non-trainable params: 0
   Shared params in forward computation graph: 0
   Unique parameters in model: 440812
   --------------------------------------------------------------------------------
   Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
   __GI___pthread_mutex_lock (mutex=0x20) at ../nptl/pthread_mutex_lock.c:65
   65      ../nptl/pthread_mutex_lock.c: No such file or directory.
   #0  __GI___pthread_mutex_lock (mutex=0x20) at ../nptl/pthread_mutex_lock.c:65
   #1  0x00007f283d52f170 in mxnet::engine::ThreadedVar::AppendWriteDependency(mxnet::engine::OprBlock*) () from /mxnet/python/mxnet/../../build/libmxnet.so
   #2  0x00007f283d52ad3f in mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool) () from /mxnet/python/mxnet/../../build/libmxnet.so
   #3  0x00007f283d527b85 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) () from /mxnet/python/mxnet/../../build/libmxnet.so
   #4  0x00007f283d3dbcb1 in MXEnginePushAsync () from /mxnet/python/mxnet/../../build/libmxnet.so
   #5  0x00007f271d56618d in byteps::mxnet::byteps_mxnet_push_pull_async (tensor=0x45797570, name=<optimized out>, version=0, priority=0, is_average=<optimized out>) at byteps/mxnet/ops.cc:116
   #6  0x00007f28f290edae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #7  0x00007f28f290e71f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #8  0x00007f28f2b225c4 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   #9  0x00007f28f2b22c33 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
   ```
   setting `MXNET_MKLDNN_ENABLED=0` did not help. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612543545
 
 
   > @eric-haibin-lin I am not aware of this before. Is there any version of MXNet-MKL that works? I suspect it might be due to some incompatibility between BytePS and MKL, if none of existing MKL works.
   
   
   @ymjiang that's possible.. I have not tested BytePS with mxnet-mkl before

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
TaoLv commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612016269
 
 
   @eric-haibin-lin @ymjiang Sorry, I have no idea because the back trace doesn't provide useful information. Can you try to disable `MXNET_MKLDNN_ENABLED` before running the model?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612228518
 
 
   `https://repo.mxnet.io/dist/python/cu100/mxnet_cu100-1.6.0-py2.py3-none-manylinux1_x86_64.whl` works, 
   `https://repo.mxnet.io/dist/python/cu100mkl/mxnet_cu100mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl` doesn't. 
   
   So, the issue has been there for a while .. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612126769
 
 
   @ymjiang have you seen this err before? `Error disabling address space randomization: Operation not permitted`. It shows up when I used gdb though. might not be related to the original problem
   
   Update: `Error disabling address space randomization: Operation not permitted` is resolved by `--security-opt seccomp=unconfined`. It does not help the segmentation fault problem

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612543344
 
 
   @TaoLv The communication API uses the PushAsync api from engine, which unfortunately is not supported by the NaiveEngine.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-611902316
 
 
   bytePS CI only test stable mxnet, not mxnet nightly. 
   
   I assume bytepsimage/mxnet uses official mxnet - @ymjiang please correct me if i'm wrong

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] pengzhao-intel edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
pengzhao-intel edited a comment on issue #18014: enabling mkldnn leads to segfault in bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-611876912
 
 
   Is there any test case for bytePS to cover this kind of issue?   
   
   BTW, is "bytepsimage/mxnet"  the official MXNet?
   "I am using bytepsimage/mxnet docker image to test bytePS. "

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] pengzhao-intel commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
pengzhao-intel commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616282218


   @eric-haibin-lin based on the comments from @TaoLv @xinyu-intel, we don't see anything related to MKLDNN. I think bytePS developer should spend some efforts to locate the reason and get back with clear analysis.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616241624


   Another thing is that even if I build mxnet 1.6.x from source using make as the following, running a simple push pull test also leads to segmentation fault: 
   ```
   cp make/config.mk .
   echo "USE_BLAS=openblas" >>config.mk
   echo "ADD_CFLAGS += -I/usr/include/openblas" >>config.mk
   echo "USE_CUDA=1" >>config.mk
   echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
   echo "USE_CUDNN=0" >>config.mk
   echo "USE_OPENCV=0" >>config.mk
   make -j
   ```
   
   
   I also tried to upgrade to gcc 9 (suggested by @leezu) but got error including cuda headers: 
   ```
   [April 19, 2020, 3:47 PM] Lin, Haibin: /usr/local/cuda/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
     129 | #error -- unsupported GNU version! gcc versions later than 7 are not supported!
         |  ^~~~~
   In file included from /usr/local/cuda/include/cuda_runtime.h:83,
                    from <command-line>:
   /usr/local/cuda/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
     129 | #error -- unsupported GNU version! gcc versions later than 7 are not supported!
         |  ^~~~~
   
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] ymjiang commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
ymjiang commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616285882


   We will investigate this from BytePS-side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-708008268


   - l


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] leezu commented on issue #18014: enabling mkldnn leads to segfault in bytePS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616256725


   For cuda 10.1, gcc8 is supported


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu removed a comment on issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
leezu removed a comment on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-708008268


   - l


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ymjiang commented on issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
ymjiang commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616331720


   @eric-haibin-lin Could you try this PR? https://github.com/bytedance/byteps/pull/244 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu closed issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
leezu closed issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] leezu commented on issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-617410457


   @xinyu-intel could the issue you report in https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-612375492 be the same as https://github.com/apache/incubator-mxnet/issues/18090
   
   BytePS side fixed some bugs that caused a different crash when using MXNet MKLDNN build, so the remaining issue now is only ThreadedEngine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin commented on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-617429835


   > Another thing is that even if I build mxnet 1.6.x from source using make as the following (without MKLDNN), running a simple push pull test also leads to segmentation fault:
   > 
   > cp make/config.mk .
   > echo "USE_BLAS=openblas" >>config.mk
   > echo "ADD_CFLAGS += -I/usr/include/openblas" >>config.mk
   > echo "USE_CUDA=1" >>config.mk
   > echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
   > echo "USE_CUDNN=0" >>config.mk
   > echo "USE_OPENCV=0" >>config.mk
   > make -j
   > 
   
   This PR https://github.com/bytedance/byteps/pull/244 also resolves the segfault issue when MXNet is built from source with MKLDNN=0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] eric-haibin-lin edited a comment on issue #18014: segfault with bytePS

Posted by GitBox <gi...@apache.org>.
eric-haibin-lin edited a comment on issue #18014:
URL: https://github.com/apache/incubator-mxnet/issues/18014#issuecomment-616241624






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org