You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by barry-jin <no...@github.com> on 2020/09/01 00:03:52 UTC

[apache/incubator-mxnet] [RFC] Turn Off CuDNN When Training PSPNet (#19056)

## Error when training PSPNet on Cityscapes dataset using GluonCV #17439

### Problem Description
The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started. 

### Debugging
After bisect the date of failure, I find the first bad commit is [PR 13896](https://github.com/apache/incubator-mxnet/pull/13896), which introduced this problem. 

## Proposed solutions
Turn off CuDNN by setting `cudnn_off` to `True` in [Dropout](https://github.com/apache/incubator-mxnet/blob/9b22c8c2e935cd42ff0f7d339a4b790f5b3367b6/python/mxnet/gluon/nn/basic_layers.py#L271)

## References
- list reference and related literature 
[Issue #17439](https://github.com/apache/incubator-mxnet/issues/17439), [PR #13896](https://github.com/apache/incubator-mxnet/pull/13896)
- list known implementations


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056

Re: [apache/incubator-mxnet] [RFC] To Fix the Hang Problem in Training PSPNet (#19056)

Posted by "Joshua Z. Zhang" <no...@github.com>.

+1 to @sxjscience , the segmentation model training adopts the `DataParallel` pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684118853

Re: [apache/incubator-mxnet] [RFC] Turn Off CuDNN in Dropout When Training PSPNet (#19056)

Posted by Xingjian Shi <no...@github.com>.

In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112702

Re: [apache/incubator-mxnet] [RFC] Turn Off CuDNN in Dropout When Training PSPNet (#19056)

Posted by Xingjian Shi <no...@github.com>.

This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to

- Fix cudnn dropout
- Consider to drop CuDNN Dropout if we can accelerate our native dropout 

In fact, we haven't used cuda calls like [curand4 (curandStatePhilox4_32_10_t *state)](https://docs.nvidia.com/cuda/curand/device-api-overview.html) when implementing the random operators.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112246

Re: [apache/incubator-mxnet] [RFC] Turn Off CuDNN in Dropout When Training PSPNet (#19056)

Posted by Leonard Lausen <no...@github.com>.

Why not fix the hang instead of disabling the feature?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112211