You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/10/16 08:27:19 UTC

[GitHub] [incubator-mxnet] mseth10 opened a new issue #19360: SegFault while testing MXNet binaries for CUDA-11.0 using pytest

mseth10 opened a new issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360


   ## Description
   Nightly CD pipeline fails for CUDA 11.0 during testing of MXNet binaries using `pytest`. All tests run successfully. The error is thrown during cleanup after `pytest` is done running a testing module. This error was first recorded when https://github.com/apache/incubator-mxnet/commit/480d027b85d3feff6fecec70be55eb244ddff289 commit was merged, which dropped `pytest`'s `teardown` function. Before this commit, the CD pipeline was running successfully for all flavors.
   
   This error is specific to CUDA 11.0 and is not observed for CUDA 10.0 and 10.1 as can be seen here:
   https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1848/pipeline/361/
   
   ### Error Message
   ```
   Stack trace:
   Stack trace:
     /usr/lib64/libcudnn_ops_infer.so.8 (                                           + 0x15cb68f)  [0x7f7f4ce3e68f]
     /usr/lib64/libcudnn_ops_infer.so.8 ( cudnnDestroy                              + 0x6f  )  [0x7f7f4ba78ddf]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( mshadow::Stream<mshadow::gpu>::DestroyDnnHandle()  + 0x2c  )  [0x7f81869a29ec]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)  + 0x13b )  [0x7f81869a2c3b]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)  + 0x1bb )  [0x7f81869b83ab]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)  + 0x36  )  [0x7f81869b86f6]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()  + 0x32  )  [0x7f81869b3db2]
   /work/runtime_functions.sh: line 747:     6 Segmentation fault      (core dumped) pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py
   2020-10-16 07:44:31,682 - root - INFO - Waiting for status of container a8b282e29adf for 600 s.
   2020-10-16 07:44:31,853 - root - INFO - Container exit status: {'Error': None, 'StatusCode': 139}
   2020-10-16 07:44:31,854 - root - ERROR - Container exited with an error 😞
   2020-10-16 07:44:31,854 - root - INFO - Executed command for reproduction:
   
   ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110
   ```
   
   ### Steps to reproduce
   I was able to reproduce the error by following these steps on an AWS Ubuntu18 Deep Learning Base AMI:
   ```
   alias python=python3
   
   git clone --recursive https://github.com/apache/incubator-mxnet.git
   cd incubator-mxnet
   pip3 install -r ci/requirements.txt --user
   
   sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
   sudo chmod +x /usr/local/bin/docker-compose
   sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
   
   python ci/build.py -e BRANCH=null --docker-registry mxnetci --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_static_libmxnet cu110
   
   python ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110
   ```
   
   ## What have you tried to solve it?
   
   1. The above script takes a long time to run as it runs a lot of tests. I reduced the reproduction time by reducing the number of tests. Here's a code diff:
   ```
   diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
   index 40405b961..6992caa36 100755
   --- a/ci/docker/runtime_functions.sh
   +++ b/ci/docker/runtime_functions.sh
   @@ -756,7 +756,9 @@ cd_unittest_ubuntu() {
        export DMLC_LOG_STACK_TRACE_DEPTH=10
    
        local mxnet_variant=${1:?"This function requires a mxnet variant as the first argument"}
   +    pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py
    
   +    : '
        OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -m 'not serial' -n 4 --durations=50 --verbose tests/python/unittest
        pytest -m 'serial' --durations=50 --verbose tests/python/unittest
    
   @@ -782,6 +784,7 @@ cd_unittest_ubuntu() {
        if [[ ${mxnet_variant} = *mkl ]]; then
            OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -n 4 --durations=50 --verbose tests/python/mkl
        fi
   +    '
    }
   ```
   2. I put a print statement before the `waitall` [command](https://github.com/apache/incubator-mxnet/blob/d0ceecbb3e4f2154a7783cba8f6e152b8c9003b1/conftest.py#L68) to check whether it gets executed and observed that it gets executed after the module ends as expected.
   
   ## Environment
   
   ***We recommend using our script for collecting the diagnostic information with the following command***
   `curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3`
   
   <details>
   <summary>Environment Information</summary>
   
   ```
   ----------Python Info----------
   Version      : 3.6.9
   Compiler     : GCC 8.4.0
   Build        : ('default', 'Oct  8 2020 12:12:24')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 20.2.3
   Directory    : /usr/local/lib/python3.6/dist-packages/pip
   ----------MXNet Info-----------
   No MXNet installed.
   ----------System Info----------
   Platform     : Linux-5.4.0-1028-aws-x86_64-with-Ubuntu-18.04-bionic
   system       : Linux
   node         : ip-172-31-5-167
   release      : 5.4.0-1028-aws
   version      : #29~18.04.1-Ubuntu SMP Tue Oct 6 17:14:23 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              64
   On-line CPU(s) list: 0-63
   Thread(s) per core:  2
   Core(s) per socket:  16
   Socket(s):           2
   NUMA node(s):        2
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               85
   Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
   Stepping:            7
   CPU MHz:             3109.947
   BogoMIPS:            5000.00
   Hypervisor vendor:   KVM
   Virtualization type: full
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            1024K
   L3 cache:            36608K
   NUMA node0 CPU(s):   0-15,32-47
   NUMA node1 CPU(s):   16-31,48-63
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0088 sec, LOAD: 0.6286 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1193 sec, LOAD: 0.1101 sec.
   Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.055264949798583984 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0012 sec, LOAD: 0.1100 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0014 sec, LOAD: 0.3008 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0010542869567871094 sec.
   ----------Environment----------
   ```
   
   </details>
   @leezu @TristonC


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #19360: SegFault while testing MXNet binaries for CUDA-11.0 using pytest

Posted by GitBox <gi...@apache.org>.

szha commented on issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360#issuecomment-712361751


   As a short-term solution it's ok. At some point, we may benefit from being able to destruct engine properly at runtime other than at exit. For example, this could enable switching the engine at runtime. Thus, it would still be better if we have an actual solution for destruction order.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on issue #19360: SegFault while testing MXNet binaries for CUDA-11.0 using pytest

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360#issuecomment-710476208


   Ok, so I think I understand this issue more - the problem is that `shared_ptr` to the engine is a static variable here: https://github.com/apache/incubator-mxnet/blob/master/src/engine/engine.cc#L62 and so the destruction timing of the engine itself is not specified (depends on the order of binaries in the linked executable). This makes it possible for CUDA deinitialization to happen before or after the destruction of the engine. If it happens after then everything is OK, because as part of its destruction engine actually joins on the side threads. However, if the CUDA deinit happens before, then side thread doing the cleanup actually triggers the segfault.
   
   The easiest workaround would be to just skip cleanup on a side thread - @szha @mseth10 @leezu do you think that would be acceptable? Any other ideas?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on issue #19360: SegFault while testing MXNet binaries for CUDA-11.0 using pytest

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360#issuecomment-710129655

We recently saw this issue too and I am looking for a fix now. I do not believe it is CUDA 11 specific, rather code layout/timing/environment specific - e.g. in our setup we did not see this issue on Ubuntu 18.04 but encounter it on 20.04. The problem is that MXNet does not actually wait for the side thread to finish before the program teardown. During the main thread teardown CUDA deinitializes itself. If the side thread is still running at this point and tries to destroy its mshadow stream, this calls `cudnnDestroy` on the cuDNN handle, which internally calls `cudaStreamDestroy` on cuDNN internal CUDA streams (CUDA is statically linked in cuDNN, which is why you see your segfault coming from `libcudnn_ops_infer.so.8`). When this call is done after the CUDA deinitialization, crash happens.

I started looking at this yesterday - brief look at the destructors seems to imply that `join` should actually be called on the side threads, so not yet sure why this does not actually do the right thing. If anyone has more experience with the internals of the `ThreadedEnginePerDevice` I would be happy to leave that issue to them, but poking in the meantime.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on issue #19360: SegFault while testing MXNet binaries for CUDA-11.0 using pytest

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360#issuecomment-712395017


   Ok, I will open then a PR with the workaround and let's open an issue for the better handling of the destruction order of the engine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha closed issue #19360: SegFault while testing MXNet binaries for CUDA-11.0 using pytest

Posted by GitBox <gi...@apache.org>.

szha closed issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org