You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by "zhouhai ye (JIRA)" <ji...@apache.org> on 2018/04/28 01:54:00 UTC

[jira] [Created] (MXNET-366) Extend MXNet Distributed Training by MPI AllReduce

zhouhai ye created MXNET-366:
--------------------------------

             Summary: Extend MXNet Distributed Training by MPI AllReduce
                 Key: MXNET-366
                 URL: https://issues.apache.org/jira/browse/MXNET-366
             Project: Apache MXNet
          Issue Type: New Feature
            Reporter: zhouhai ye
         Attachments: performance-allreduce.png, resnet-50.png

We add one type of new kvstore (dist_sync_mpi) which extend MXNet distributed training by MPI AllReduce. In this type of kvstore, since there's no parameter server, we replace original kvstore apis push and pull with one single api pushpull. You can refer API Spec part in the design doc for details.

Our design doc: [https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094]

 

Attached has the performance and accuracy info.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org