You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/19 09:41:02 UTC

[GitHub] rahul003 commented on a change in pull request #9135: tutorial for distributed training

rahul003 commented on a change in pull request #9135: tutorial for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/9135#discussion_r157704685
 
 

 ##########
 File path: docs/faq/multi_devices.md
 ##########
 @@ -82,124 +82,7 @@ Note that this option may result in higher GPU memory usage.
 
 When using a large number of GPUs, e.g. >=4, we suggest using `device` for better performance.
 
-## Distributed Training with Multiple Machines
+## Multiple devices across machines
 
-`KVStore` also supports a number of options for running on multiple machines.
-
-- `dist_sync` behaves similarly to `local` but exhibits one major difference.
-  With `dist_sync`, `batch-size` now means the batch size used on each machine.
-  So if there are *n* machines and we use batch size *b*,
-  then `dist_sync` behaves like `local` with batch size *n\*b*.
-- `dist_device_sync` is similar to `dist_sync`. The difference between them is that
-  `dist_device_sync` aggregates gradients and updates weight on GPUs
-  while `dist_sync` does so on CPU memory.
-- `dist_async`  performs asynchronous updates.
-  The weight is updated whenever gradients are received from any machine.
-  The update is atomic, i.e., no two updates happen on the same weight at the same time.
-  However, the order is not guaranteed.
-
-### How to Launch a Job
-
-> To use distributed training, we need to compile with `USE_DIST_KVSTORE=1`
-> (see [MXNet installation guide](http://mxnet.io/get_started/install.html) for more options).
-
-Launching a distributed job is a bit different from running on a single
-machine. MXNet provides
-[tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py) to
-start a job by using `ssh`, `mpi`, `sge`, or `yarn`.
-
-An easy way to set up a cluster of EC2 instances for distributed deep learning
-is using an [AWS CloudFormation template](https://github.com/awslabs/deeplearning-cfn).
-If you do not have a cluster, you can check the repository before you continue.
-
-Assume we are at the directory `mxnet/example/image-classification`
-and want to train LeNet to classify MNIST images, as demonstrated here:
-[train_mnist.py](https://github.com/dmlc/mxnet/blob/master/example/image-classification/train_mnist.py).
-
-On a single machine, we can run:
-
-```bash
-python train_mnist.py --network lenet
-```
-
-Now, say we are given two ssh-able machines and _MXNet_ is installed on both machines.
-We want to train LeNet on these two machines.
-First, we save the IPs (or hostname) of these two machines in file `hosts`, e.g.
-
-```bash
-$ cat hosts
-172.30.0.172
-172.30.0.171
-```
-
-Next, if the mxnet folder is accessible from both machines, e.g. on a
-[network filesystem](https://help.ubuntu.com/lts/serverguide/network-file-system.html),
-then we can run:
-
-```bash
-python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync
-```
-
-Note that here we
-
-- use `launch.py` to submit the job.
-- provide launcher, `ssh` if all machines are ssh-able, `mpi` if `mpirun` is
-  available, `sge` for Sun Grid Engine, and `yarn` for Apache Yarn.
-- `-n` number of worker nodes to run on
-- `-H` the host file which is required by `ssh` and `mpi`
-- `--kv-store` use either `dist_sync` or `dist_async`
-
-
-### Synchronize Directory
-
-Now consider if the mxnet folder is not accessible.
-We can first copy the `MXNet` library to this folder by
-```bash
-cp -r ../../python/mxnet .
-cp -r ../../lib/libmxnet.so mxnet
-```
-
-then ask `launch.py` to synchronize the current directory to all machines'
- `/tmp/mxnet` directory with `--sync-dst-dir`
-
-```bash
-python ../../tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet \
-   python train_mnist.py --network lenet --kv-store dist_sync
-```
-
-
-### Gradient compression
-
-If your model has fully connected components or recurrent neural networks, you may achieve increased training speed using gradient compression with potentially slight loss of accuracy. Please see [Gradient Compression](https://mxnet.incubator.apache.org/versions/master/faq/gradient_compression.html) for more details on when and how to use it. For the above example, gradient compression can be enabled by running the following:
-
-```bash
-python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet \
-    --kv-store dist_sync --gc-type 2bit
-```
-
-In this example, `gc-type` has been set to `2bit`, to enable two bit gradient compression.
-
-
-### Use a Particular Network Interface
-
-_MXNet_ often chooses the first available network interface.
-But for machines that have multiple interfaces,
-we can specify which network interface to use for data
-communication by the environment variable `DMLC_INTERFACE`.
-For example, to use the interface `eth0`, we can
-
-```
-export DMLC_INTERFACE=eth0; python ../../tools/launch.py ...
-```
-
-### Debug Connection
-
-Set`PS_VERBOSE=1` to see the debug logging, e.g
-```
-export PS_VERBOSE=1; python ../../tools/launch.py ...
-```
-
-### More
-
-- See more launch options by `python ../../tools/launch.py -h`
-- See more options of [ps-lite](http://ps-lite.readthedocs.org/en/latest/how_to.html)
+Refer [Distributed training](https://mxnet.incubator.apache.org/versions/master/how_to/distributed_training.html)
 
 Review comment:
   This versioning is pretty weird right now. I went through the codebase and saw that some links use master version while others use `mxnet.incubator.apache.org/how_to...` which redirects to the latest released version. We need to fix this yes. I decided to put master for now, since some links can break if directed to older version of the site.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services