You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by te...@apache.org on 2020/04/05 01:12:46 UTC

[incubator-mxnet] 01/01: Add instructions on distributed MXNet with Horovod on Kubernetes

This is an automated email from the ASF dual-hosted git repository.

terrytangyuan pushed a commit to branch terrytangyuan-patch-1
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git

commit 7d0d23ec774c12dd2b5ceb3f10707e16c2603536
Author: Yuan Tang <te...@gmail.com>
AuthorDate: Sat Apr 4 21:11:59 2020 -0400

    Add instructions on distributed MXNet with Horovod on Kubernetes
---
 example/distributed_training-horovod/README.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/example/distributed_training-horovod/README.md b/example/distributed_training-horovod/README.md
index 961f7f6..b2b4ad7 100644
--- a/example/distributed_training-horovod/README.md
+++ b/example/distributed_training-horovod/README.md
@@ -30,7 +30,8 @@ to communicate parameters between workers. There is no dedicated server and the
 between workers does not depend on the number of workers. Therefore, it scales well in the case where 
 there are a large number of workers and network bandwidth is the bottleneck.
 
-# Install
+# Setup
+
 ## Install MXNet
 ```bash
 $ pip install mxnet
@@ -53,6 +54,10 @@ Steps to install Open MPI are listed [here](https://www.open-mpi.org/faq/?catego
 **Note**: Open MPI 3.1.3 has an issue that may cause hangs.  It is recommended
 to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
 
+## On Kubernetes
+
+Distributed MXNet jobs with Horovod can be submitted to a Kubernetes cluster via [Kubeflow MPI Operator](https://github.com/kubeflow/mpi-operator). Please refer to [this example](https://github.com/kubeflow/mpi-operator/tree/master/examples/mxnet) for details, including the Dockerfile with all the dependencies mentioned in previous sections, distributed training Python script based on Horovod, and the YAML configuration file that can be used for submitting a job on a Kubernetes cluster.
+
 # Usage
 
 To run MXNet with Horovod, make the following additions to your training script: