You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/01/07 16:03:11 UTC

[GitHub] [incubator-mxnet] chandana1332 opened a new issue #17237: Data imbalance handling in MXNet Gluon

chandana1332 opened a new issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237
 
 
   Hello,
   
   My question regarding data imbalance handling in Gluon is as follows:
   
   Suppose I'm training with 4 GPUs. For an update, my training loop samples 4 batches (one for each GPU) and runs fwd/bkwd on them. Using a Gluon Trainer, I can reduce and update gradients on all 4 GPUs. 
   
   Now I'm towards the end of an epoch and I only have 2 batches left to process. I sample those 2 batches, send them off to the first two GPUs, run fwd/bkwd. At this point, 2 GPUs have non-zero gradients. If I do a Trainer.step(), how does it reduce gradients on all GPUS?
   
   1. Do the GPUs that didn't process a batch contribute zero gradients during the reduce operation ? So all GPUs participate in the redcution operation?
   2. Do only the GPUs that have non-zero gradients send their gradients for reduction to a server and then the reduced gradient is broadcasted to all GPUs?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #17237: Data imbalance handling in MXNet Gluon

Posted by GitBox <gi...@apache.org>.

eric-haibin-lin commented on issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237#issuecomment-571816894
 
 
   I recently updated the split sampler in gluonnlp, such that the number of samplers for each worker will always be the same (with `even_size=True`). https://gluon-nlp.mxnet.io/master/api/modules/data.html?highlight=splitsampler#gluonnlp.data.SplitSampler 
   
   This somewhat avoids the imbalanced data batch problem. If it is useful I can upstream the sampler to mxnet, too
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] apeforest commented on issue #17237: Data imbalance handling in MXNet Gluon

Posted by GitBox <gi...@apache.org>.

apeforest commented on issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237#issuecomment-571712174
 
 
   One way is to let the Gluon Dataloader handle the last batch using [discard/ rollover] so that every GPU process the same number of samples. 
   
   https://mxnet.apache.org/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.DataLoader
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] chandana1332 commented on issue #17237: Data imbalance handling in MXNet Gluon

Posted by GitBox <gi...@apache.org>.

chandana1332 commented on issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237#issuecomment-571651107
 
 
   @eric-haibin-lin @apeforest 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] chandana1332 commented on issue #17237: Data imbalance handling in MXNet Gluon

Posted by GitBox <gi...@apache.org>.

chandana1332 commented on issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237#issuecomment-571719098
 
 
   So that only works when the number of samples being samples is less than batch_size but I'm talking about a case when number of batches being sampled is less than number of GPUs.
   Hence, the scenario I'm talking about is outside of the data loader. 
   
   Also, we don't have an issue handling data imbalance but I'm trying to understand the internals of how MXNet does it. 
   
   Today, we sample batches and if the number of sampled batches is less than number of GPUs, we just simple process batches on those GPUs and do a trainer.step() which reduces the gradients correctly and updates params. I would like to understand how MXNET handles this internally in the PS architecture.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on issue #17237: Data imbalance handling in MXNet Gluon

Posted by GitBox <gi...@apache.org>.

eric-haibin-lin commented on issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237#issuecomment-571816022
 
 
   Hi @chandana1332 
   
   Thanks for posting the question here. 
   
   If your current mini-batch is small and GPU 3 & 4 does not even have 1 sample, the gradient on GPU 3 & 4 will remain the same as what they were for the previous iteration. Therefore, for this case, the allreduced_gradient will be based on refresh gradients from GPU 1 & 2, and stale gradients from GPU 3 & 4. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] szha closed issue #17237: Data imbalance handling in MXNet Gluon

Posted by GitBox <gi...@apache.org>.

szha closed issue #17237:
URL: https://github.com/apache/incubator-mxnet/issues/17237


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org