You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/19 14:58:27 UTC

[GitHub] [incubator-mxnet] QueensGambit commented on issue #15529: MXNET_CUDNN_AUTOTUNE_DEFAULT problems

QueensGambit commented on issue #15529: MXNET_CUDNN_AUTOTUNE_DEFAULT problems
URL: https://github.com/apache/incubator-mxnet/issues/15529#issuecomment-513262030
 
 
   Hello @intgogo,
   I experienced a similar problem during training of a CNN-model with multiple output heads.
   Are you using the Gluon or MXNet python interface for training your model?
   
   The problem appeared for me more often in cases where I increased the number of layers of the same model architecture.
   In my case the CUDNN optimization appeared before every call of:
   ```
   for _, (data, value_label, policy_label) in enumerate(train_data):
   ```
   when I changed my `training_data` content.
   
   For some models this optimization took roughly 15s causing a long delay during training, but disabling CUDNN-optimization resulted in an even longer training time.
   I ended up porting the model architecture and training code to MXNet's symbol API which took me ~2 days of work, but I didn't encounter this problem anymore.
   
   I investigated the source code a bit and it could be that the optimization isn't cached properly for all models.
   [cudnn_algoreg-inl.h](https://github.com/apache/incubator-mxnet/blob/30d479feae11b04d423f2ec055d124fb6d2d5730/src/operator/nn/cudnn/cudnn_algoreg-inl.h#L74)
   
   https://github.com/apache/incubator-mxnet/blob/3f3ba92ae1468d08de088d2291ca14e2d5dc5515/src/operator/nn/cudnn/cudnn_deconvolution-inl.h#L818
   ```C++
       // An algo specification by the user may be cached here, but another
       // convolution will match only if identically specified.
       // We're caching results of *Get* as well as *Find*, but these records
       // will be held distinctly because param_.cudnn_tune is part of the key
       CuDNNDeconvAlgoReg::Get()->FindOrElseRegister(...)
   ```
   It might be useful for users to provide additionally an explicit call for this operation and also to save it within the model or a seperate file.
   Currently, if you load the model for inference it has recompute the CUDNN optimization on the start-up of the application.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services