You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/22 03:25:25 UTC

[GitHub] pracheer commented on a change in pull request #9152: tutorial for distributed training

pracheer commented on a change in pull request #9152: tutorial for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/9152#discussion_r158419876

##########
File path: docs/faq/distributed_training.md
##########
@@ -0,0 +1,286 @@
+# Distributed training
+MXNet supports distributed training enabling us to leverage multiple machines for faster training.
+In this document, we describe how it works, how to launch a distributed training job and
+some environment variables which provide more control.
+
+## Type of parallelism
+There are two ways in which we can distribute the workload of training a neural network across multiple devices (can be either GPU or CPU).
+The first way is *data parallelism*, which refers to the case where each device stores a complete copy of the model.
+Each device works with a different part of the dataset, and the devices collectively update a shared model.
+These devices can be located on a single machine or across multiple machines.
+In this document, we describe how to train a model with devices distributed across machines in a data parallel way.
+
+When models are so large that they don't fit into device memory, then a second way called *model parallelism* is useful.
+Here, different devices are assigned the task of learning different parts of the model.
+Currently, MXNet supports Model parallelism in a single machine only. Refer [Training with multiple GPUs using model parallelism](https://mxnet.incubator.apache.org/versions/master/how_to/model_parallel_lstm.html) for more on this.
+
+## How does distributed training work?
+The architecture of distributed training in MXNet is as follows:

Review comment:
nitpick: "architecture" word here is a bit off since where you go on from there are more like concepts involved in distributed training. So do you think "concepts involved in distributed training in MXNet are as follows" is more appropriate?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services