You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/11/22 02:12:52 UTC

[GitHub] rahul003 commented on a change in pull request #8762: Gradient compression faq

rahul003 commented on a change in pull request #8762: Gradient compression faq
URL: https://github.com/apache/incubator-mxnet/pull/8762#discussion_r152453417
 
 

 ##########
 File path: docs/faq/gradient_compression.md
 ##########
 @@ -0,0 +1,95 @@
+# Gradient Compression
+
+Gradient Compression reduces communication bandwidth to make distributed training with GPUs more scalable and efficient without significant loss in convergence rate or accuracy.
+
+
+## Benefits
+
+**Increased Speed**
+
+For tasks like acoustic modeling in speech recognition (like in Alexa), the gradient compression capability is observed to speedup training by about 2 times, depending on the size of the model and the network bandwidth of the instance. Bigger models see larger speedup with gradient compression.
+
+**Minimal Accuracy Loss**
+
+Gradient compression uses the approach of delaying the synchronization of weight updates which are small. Although small weight updates might not be sent for that batch, this information is not discarded. Once the weight updates for this location accumulate to become a larger value, they will be propagated. Since there is no information loss, but only delayed updates, it does not lead to a significant loss in accuracy or convergence rate. In distributed training experiments[1], it is observed a loss of accuracy as low as 1% for this technique.
+
+
+## When to Use Gradient Compression
+
+When training models whose architectures include large fully connected components, it can be helpful to use gradient compression. For larger models, the communication cost becomes a major factor. Such models stand to benefit greatly with gradient compression.
+
+
+### GPU versus CPU
+
+The greatest benefits from gradient compression are realized when using GPUs for both single-node multi-GPU and multi-node (single or multi-GPU) distributed training. Training on CPU would provide a lower compute density per compute node as compared to the massive compute density per compute node on a GPU. Due to this, the required communication bandwidth for CPU-based nodes during training is not as high as for GPU-based nodes. Hence, the benefits of gradient compression are lower for CPU-based nodes as compared to GPU-based nodes.
+
+
+### Scaling
 
 Review comment:
   I think this should go at the end, some section like can gradient compression help training on a single instance too? 
   Because the rest of the page would be about distributed training. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services