You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by barry-jin <no...@github.com> on 2020/09/01 00:03:52 UTC
[apache/incubator-mxnet] [RFC] Turn Off CuDNN When Training PSPNet
(#19056)
## Error when training PSPNet on Cityscapes dataset using GluonCV #17439
### Problem Description
The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started.
### Debugging
After bisect the date of failure, I find the first bad commit is [PR 13896](https://github.com/apache/incubator-mxnet/pull/13896), which introduced this problem.
## Proposed solutions
Turn off CuDNN by setting `cudnn_off` to `True` in [Dropout](https://github.com/apache/incubator-mxnet/blob/9b22c8c2e935cd42ff0f7d339a4b790f5b3367b6/python/mxnet/gluon/nn/basic_layers.py#L271)
## References
- list reference and related literature
[Issue #17439](https://github.com/apache/incubator-mxnet/issues/17439), [PR #13896](https://github.com/apache/incubator-mxnet/pull/13896)
- list known implementations
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056
Re: [apache/incubator-mxnet] [RFC] To Fix the Hang Problem in
Training PSPNet (#19056)
Posted by "Joshua Z. Zhang" <no...@github.com>.
+1 to @sxjscience , the segmentation model training adopts the `DataParallel` pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684118853
Re: [apache/incubator-mxnet] [RFC] Turn Off CuDNN in Dropout When
Training PSPNet (#19056)
Posted by Xingjian Shi <no...@github.com>.
In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112702
Re: [apache/incubator-mxnet] [RFC] Turn Off CuDNN in Dropout When
Training PSPNet (#19056)
Posted by Xingjian Shi <no...@github.com>.
This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to
- Fix cudnn dropout
- Consider to drop CuDNN Dropout if we can accelerate our native dropout
In fact, we haven't used cuda calls like [curand4 (curandStatePhilox4_32_10_t *state)](https://docs.nvidia.com/cuda/curand/device-api-overview.html) when implementing the random operators.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112246
Re: [apache/incubator-mxnet] [RFC] Turn Off CuDNN in Dropout When
Training PSPNet (#19056)
Posted by Leonard Lausen <no...@github.com>.
Why not fix the hang instead of disabling the feature?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112211