You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by "Kellen Sunderland (JIRA)" <ji...@apache.org> on 2018/03/30 17:37:00 UTC
[jira] [Created] (MXNET-255) Deadlock during ThreadedEnginePerDevice destructor after CuDNNConvolutionOp::SelectAlgo called

Kellen Sunderland created MXNET-255:
---------------------------------------

             Summary: Deadlock during ThreadedEnginePerDevice destructor after CuDNNConvolutionOp<float>::SelectAlgo called
                 Key: MXNET-255
                 URL: https://issues.apache.org/jira/browse/MXNET-255
             Project: Apache MXNet
          Issue Type: Bug
            Reporter: Kellen Sunderland


Also described here: https://github.com/apache/incubator-mxnet/issues/10341

I haven't been able to fully track this down, but this is what I've seen so far. CI is frequently deadlocking after the introduction of the test : test_operator_gpu:test_op_output_names_monitor in the PR [#10300|https://github.com/apache/incubator-mxnet/pull/10300]

Removing the lines:
conv_sym = mx.sym.Convolution(data, kernel=(2, 2), num_filter=1, name='conv') check_name(conv_sym, ['conv_output'])
here: [https://github.com/apache/incubator-mxnet/pull/10300/files#diff-cb652780258e73a9cd08568f38929aa2R5417] will prevent the deadlocks

It appears the test has exposed the deadlock, but that the deadlock was in the codebase prior to this PR. I believe what is happening is that calling inference on the Conv operator in this way creates two zombie threads that are deadlocked on each other. The test continues to run (and in fact the full test suite runs) until the ThreadedEnginePerDevice destructor is called, at which point the ThreadPool is brought down, it attempts to join all threads and the process hangs.

It looks like Thread 20 is blocked waiting for this lock: [https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/cudnn/cudnn_convolution-inl.h#L718] -> [https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.cc#L385]

Thread 16 is blocked here: [https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L121]
h2. Environment info (Required)

CI
h2. Minimal steps to reproduce

Reproducible locally on a gpu machine with:
ci/build.py --build --platform ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_cuda91_cudnn7 nvidia-docker run -v /home/ubuntu/incubator-mxnet:/work/mxnet -v /home/ubuntu/incubator-mxnet/build:/work/build -u 1000:1000 -ti mxnet/build.ubuntu_gpu bash
Then in the container:
export PYTHONPATH=/work/mxnet/python export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 cd tests/python/gpu/ nosetests-3.4 -v test_operator_gpu:test_op_output_names_monitor
I haven't been able to fully track this down, but this is what I've seen so far. CI is frequently deadlocking after the introduction of the test : test_operator_gpu:test_op_output_names_monitor in the PR [#10300|https://github.com/apache/incubator-mxnet/pull/10300]

Removing the lines:
conv_sym = mx.sym.Convolution(data, kernel=(2, 2), num_filter=1, name='conv') check_name(conv_sym, ['conv_output'])
here: [https://github.com/apache/incubator-mxnet/pull/10300/files#diff-cb652780258e73a9cd08568f38929aa2R5417] will prevent the deadlocks

It appears the test has exposed the deadlock, but that the deadlock was in the codebase prior to this PR. I believe what is happening is that calling inference on the Conv operator in this way creates two zombie threads that are deadlocked on each other. The test continues to run (and in fact the full test suite runs) until the ThreadedEnginePerDevice destructor is called, at which point the ThreadPool is brought down, it attempts to join all threads and the process hangs.

It looks like Thread 20 is blocked waiting for this lock: [https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/cudnn/cudnn_convolution-inl.h#L718] -> [https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.cc#L385]

Thread 16 is blocked here: [https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L121]
h2. Environment info (Required)

CI
h2. Minimal steps to reproduce

Reproducible locally on a gpu machine with:
ci/build.py --build --platform ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_cuda91_cudnn7 nvidia-docker run -v /home/ubuntu/incubator-mxnet:/work/mxnet -v /home/ubuntu/incubator-mxnet/build:/work/build -u 1000:1000 -ti mxnet/build.ubuntu_gpu bash
Then in the container:

 
export PYTHONPATH=/work/mxnet/python export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 cd tests/python/gpu/ nosetests-3.4 -v test_operator_gpu:test_op_output_names_monitor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org