You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/03/21 21:04:22 UTC

[GitHub] [incubator-mxnet] apeforest edited a comment on issue #14485: Any suggestion to accelerate parameter update on PS for distributed training?

apeforest edited a comment on issue #14485: Any suggestion to accelerate parameter update on PS for distributed training?
URL: https://github.com/apache/incubator-mxnet/issues/14485#issuecomment-475401607
 
 
   @ymjiang There are a few options I suggest you to try:
   
   1) Set the env variable MXNET_KVSTORE_REDUCTION_NTHREADS. This should specify more CPU threads to perfom reduction on hour parameter server
   https://mxnet.incubator.apache.org/versions/master/faq/distributed_training.html#environment-variables
   
   2) If the parameter server is the computation bottleneck, try update_on_kvstore=False in your gluon Trainer:
   ```
   trainer = gluon.Trainer(net.collect_params(), optimizer='sgd',
                           optimizer_params={'learning_rate': opt.lr,
                                             'wd': opt.wd,
                                             'momentum': opt.momentum,
                                             'multi_precision': True},
                           kvstore=kv,
                           update_on_kvstore=False)
   ```
   With this, the parameter server only aggregates the gradients, and its the worker that will update weights locally with the gradients.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services