You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/09/01 00:03:28 UTC
[GitHub] [incubator-mxnet] barry-jin opened a new issue #19056: [RFC] Turn Off CuDNN When Training PSPNet
barry-jin opened a new issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056
## Error when training PSPNet on Cityscapes dataset using GluonCV #17439
### Problem Description
The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started.
### Debugging
After bisect the date of failure, I find the first bad commit is [PR 13896](https://github.com/apache/incubator-mxnet/pull/13896), which introduced this problem.
## Proposed solutions
Turn off CuDNN by setting `cudnn_off` to `True` in [Dropout](https://github.com/apache/incubator-mxnet/blob/9b22c8c2e935cd42ff0f7d339a4b790f5b3367b6/python/mxnet/gluon/nn/basic_layers.py#L271)
## References
- list reference and related literature
[Issue #17439](https://github.com/apache/incubator-mxnet/issues/17439), [PR #13896](https://github.com/apache/incubator-mxnet/pull/13896)
- list known implementations
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] leezu commented on issue #19056: [RFC] Turn Off CuDNN in Dropout When Training PSPNet
Posted by GitBox <gi...@apache.org>.
leezu commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112211
Why not fix the hang instead of disabling the feature?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] zhreshold commented on issue #19056: [RFC] To Fix the Hang Problem in Training PSPNet
Posted by GitBox <gi...@apache.org>.
zhreshold commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684118853
+1 to @sxjscience , the segmentation model training adopts the `DataParallel` pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] sxjscience commented on issue #19056: [RFC] Turn Off CuDNN in Dropout When Training PSPNet
Posted by GitBox <gi...@apache.org>.
sxjscience commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112702
In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] sxjscience commented on issue #19056: [RFC] Turn Off CuDNN in Dropout When Training PSPNet
Posted by GitBox <gi...@apache.org>.
sxjscience commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112246
This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to
- Fix cudnn dropout
- Consider to drop CuDNN Dropout if we can accelerate our native dropout
In fact, we haven't used cuda calls like [curand4 (curandStatePhilox4_32_10_t *state)](https://docs.nvidia.com/cuda/curand/device-api-overview.html) when implementing the random operators.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org