You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/09/01 00:03:28 UTC

[GitHub] [incubator-mxnet] barry-jin opened a new issue #19056: [RFC] Turn Off CuDNN When Training PSPNet

barry-jin opened a new issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056


   ## Error when training PSPNet on Cityscapes dataset using GluonCV #17439
   
   ### Problem Description
   The problem is when I train a PSPNet using GluonCV semantic segmentation library on the Cityscapes dataset, the training will stuck (hang) right after it started. 
   
   ### Debugging
   After bisect the date of failure, I find the first bad commit is [PR 13896](https://github.com/apache/incubator-mxnet/pull/13896), which introduced this problem. 
   
   ## Proposed solutions
   Turn off CuDNN by setting `cudnn_off` to `True` in [Dropout](https://github.com/apache/incubator-mxnet/blob/9b22c8c2e935cd42ff0f7d339a4b790f5b3367b6/python/mxnet/gluon/nn/basic_layers.py#L271)
   
   ## References
   - list reference and related literature 
   [Issue #17439](https://github.com/apache/incubator-mxnet/issues/17439), [PR #13896](https://github.com/apache/incubator-mxnet/pull/13896)
   - list known implementations
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] leezu commented on issue #19056: [RFC] Turn Off CuDNN in Dropout When Training PSPNet

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112211


   Why not fix the hang instead of disabling the feature?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] zhreshold commented on issue #19056: [RFC] To Fix the Hang Problem in Training PSPNet

Posted by GitBox <gi...@apache.org>.
zhreshold commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684118853


   +1 to @sxjscience , the segmentation model training adopts the `DataParallel` pipeline(https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L138), but it's using multithreading instead of mp


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] sxjscience commented on issue #19056: [RFC] Turn Off CuDNN in Dropout When Training PSPNet

Posted by GitBox <gi...@apache.org>.
sxjscience commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112702


   In addition, I guess is that the root cause is related to multiprocessing + cudnn dropout. Thus, we will need a minimal reproducible code snippet first.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] sxjscience commented on issue #19056: [RFC] Turn Off CuDNN in Dropout When Training PSPNet

Posted by GitBox <gi...@apache.org>.
sxjscience commented on issue #19056:
URL: https://github.com/apache/incubator-mxnet/issues/19056#issuecomment-684112246


   This does not sound like a solution. Problems related to CUDNN Dropout has a very long history and we should try to
   
   - Fix cudnn dropout
   - Consider to drop CuDNN Dropout if we can accelerate our native dropout 
   
   In fact, we haven't used cuda calls like [curand4 (curandStatePhilox4_32_10_t *state)](https://docs.nvidia.com/cuda/curand/device-api-overview.html) when implementing the random operators.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org