You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/02/16 02:40:40 UTC

[GitHub] DickJC123 edited a comment on issue #14006: Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2.

DickJC123 edited a comment on issue #14006: Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2.
URL: https://github.com/apache/incubator-mxnet/pull/14006#issuecomment-464276100
 
 
   I've rerun some perf analysis of this PR, which I'll remind everyone changes nothing in the default.  However, when I set MXNET_GPU_WORKER_NSTREAMS=2, I see higher performance for all batchsizes.  The perf gains I measured on a run across 8 Volta GPUs of Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container) were:
   ```
   batchsize  32: 6.0% speedup
   batchsize  64: 0.8% speedup
   batchsize 128: 1.6% speedup
   batchsize 256: 0.4% speedup
   ```
   The primary application area of the PR is one of scale-out training across multiple nodes, where a too-large global batchsize can impact final accuracy (thus driving per-GPU batchsize down).  The RN50 global memory increase was from 1.4% (bs 32) to 2.6% (bs 256).
   
   This work is no longer "in progress."  Requesting final review thanks. @szha @marcoabreu 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services