You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Lin Yuan <ap...@gmail.com> on 2018/11/02 22:51:19 UTC

Re: Horovod-MXNet Integration

Hi Mu,

Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
releasing MXNet-Horovod integration in production. We have made some
changes on both MXNet and Horovod sides. The changes on MXNet side have
mostly been merged and we are working to merge code to horovod repo. We
will send a design doc to you for review again next week.

Thanks for your feedback,

Lin

On Wed, Oct 31, 2018 at 12:03 PM Mu Li <mu...@gmail.com> wrote:

> Thanks for your contribution, Carl.
>
> I remember I left a comment on the proposal, but today I found it was
> disappeared. My suggestion is trying best to not change the existing API.
> The reason is that we need to change all trainers on the frontend that uses
> the existing kvstore APIs, which may cause confusion to users.
>
> The current proposal wants add the following 4 APIs into kvstore:
>
>
>    -
>
>    kv.pushpull
>    -
>
>    kv.broadcast
>    -
>
>    kv.local_rank
>    -
>
>    kv.num_local_workers
>
>
> Pushpull can be done with a sequential push and pull, you can do nothing in
> push and put all workloads into pushpull. Broadcast can be implemented by
> pull.
>
> What's local workers? GPUs in the single machine? If so, we can query it
> directly.
>
>
> On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <ca...@gmail.com> wrote:
>
> > Hi,
> >
> > Currently, MXNet distributed can only be done using parameter server.
> > Horovod is an open-source distributed training framework that has
> > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > propose to add Horovod support to MXNet. This will help our users
> > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > proposal on cwiki:
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> >
> > Please feel free to let me know if you have any suggestions or feedback.
> >
> > Regards,
> > Carl
> >
>

Re: Horovod-MXNet Integration

Posted by Aaron Markham <aa...@gmail.com>.

Congrats on the Horovod integration everyone. That's really great to hear.

On Wed, Jan 30, 2019 at 10:08 AM Lin Yuan <ap...@gmail.com> wrote:
>
> Hi Yuan,
>
> Thanks for your interest. We have just supported MXNet in Horovod and are
> working on performance tuning and adding more examples. We are definitely
> interested in further extending it's support with Kubeflow.
>
> Let's set up some time to have a more detailed discussion.
>
> Best,
>
> Lin
>
> On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang <te...@gmail.com> wrote:
>
> > Hi,
> >
> > It's great to see MXNet-Horovod integration got merged:
> > https://github.com/uber/horovod/pull/542
> >
> > Is there any future plan for this? I've been working on Kubeflow's
> > MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it
> > would
> > be interesting to see an example of using Horovod + MXNet + Kubeflow using
> > MPI Operator. Feel free to reach out (@terrytangyuan
> > <https://github.com/terrytangyuan>) if you encounter any issues.
> >
> > Best,
> > Yuan
> >
> >
> > On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <ap...@gmail.com> wrote:
> >
> > > Hi Mu,
> > >
> > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> > > releasing MXNet-Horovod integration in production. We have made some
> > > changes on both MXNet and Horovod sides. The changes on MXNet side have
> > > mostly been merged and we are working to merge code to horovod repo. We
> > > will send a design doc to you for review again next week.
> > >
> > > Thanks for your feedback,
> > >
> > > Lin
> > >
> > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li <mu...@gmail.com> wrote:
> > >
> > > > Thanks for your contribution, Carl.
> > > >
> > > > I remember I left a comment on the proposal, but today I found it was
> > > > disappeared. My suggestion is trying best to not change the existing
> > API.
> > > > The reason is that we need to change all trainers on the frontend that
> > > uses
> > > > the existing kvstore APIs, which may cause confusion to users.
> > > >
> > > > The current proposal wants add the following 4 APIs into kvstore:
> > > >
> > > >
> > > >    -
> > > >
> > > >    kv.pushpull
> > > >    -
> > > >
> > > >    kv.broadcast
> > > >    -
> > > >
> > > >    kv.local_rank
> > > >    -
> > > >
> > > >    kv.num_local_workers
> > > >
> > > >
> > > > Pushpull can be done with a sequential push and pull, you can do
> > nothing
> > > in
> > > > push and put all workloads into pushpull. Broadcast can be implemented
> > by
> > > > pull.
> > > >
> > > > What's local workers? GPUs in the single machine? If so, we can query
> > it
> > > > directly.
> > > >
> > > >
> > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <ca...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Currently, MXNet distributed can only be done using parameter server.
> > > > > Horovod is an open-source distributed training framework that has
> > > > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > > > propose to add Horovod support to MXNet. This will help our users
> > > > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > > > proposal on cwiki:
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > > > >
> > > > > Please feel free to let me know if you have any suggestions or
> > > feedback.
> > > > >
> > > > > Regards,
> > > > > Carl
> > > > >
> > > >
> > >
> >

Re: Horovod-MXNet Integration

Posted by Lin Yuan <ap...@gmail.com>.

Hi Yuan,

Thanks for your interest. We have just supported MXNet in Horovod and are
working on performance tuning and adding more examples. We are definitely
interested in further extending it's support with Kubeflow.

Let's set up some time to have a more detailed discussion.

Best,

Lin

On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang <te...@gmail.com> wrote:

> Hi,
>
> It's great to see MXNet-Horovod integration got merged:
> https://github.com/uber/horovod/pull/542
>
> Is there any future plan for this? I've been working on Kubeflow's
> MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it
> would
> be interesting to see an example of using Horovod + MXNet + Kubeflow using
> MPI Operator. Feel free to reach out (@terrytangyuan
> <https://github.com/terrytangyuan>) if you encounter any issues.
>
> Best,
> Yuan
>
>
> On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <ap...@gmail.com> wrote:
>
> > Hi Mu,
> >
> > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> > releasing MXNet-Horovod integration in production. We have made some
> > changes on both MXNet and Horovod sides. The changes on MXNet side have
> > mostly been merged and we are working to merge code to horovod repo. We
> > will send a design doc to you for review again next week.
> >
> > Thanks for your feedback,
> >
> > Lin
> >
> > On Wed, Oct 31, 2018 at 12:03 PM Mu Li <mu...@gmail.com> wrote:
> >
> > > Thanks for your contribution, Carl.
> > >
> > > I remember I left a comment on the proposal, but today I found it was
> > > disappeared. My suggestion is trying best to not change the existing
> API.
> > > The reason is that we need to change all trainers on the frontend that
> > uses
> > > the existing kvstore APIs, which may cause confusion to users.
> > >
> > > The current proposal wants add the following 4 APIs into kvstore:
> > >
> > >
> > >    -
> > >
> > >    kv.pushpull
> > >    -
> > >
> > >    kv.broadcast
> > >    -
> > >
> > >    kv.local_rank
> > >    -
> > >
> > >    kv.num_local_workers
> > >
> > >
> > > Pushpull can be done with a sequential push and pull, you can do
> nothing
> > in
> > > push and put all workloads into pushpull. Broadcast can be implemented
> by
> > > pull.
> > >
> > > What's local workers? GPUs in the single machine? If so, we can query
> it
> > > directly.
> > >
> > >
> > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <ca...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Currently, MXNet distributed can only be done using parameter server.
> > > > Horovod is an open-source distributed training framework that has
> > > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > > propose to add Horovod support to MXNet. This will help our users
> > > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > > proposal on cwiki:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > > >
> > > > Please feel free to let me know if you have any suggestions or
> > feedback.
> > > >
> > > > Regards,
> > > > Carl
> > > >
> > >
> >
>

Re: Horovod-MXNet Integration

Posted by Yuan Tang <te...@gmail.com>.

Hi,

It's great to see MXNet-Horovod integration got merged:
https://github.com/uber/horovod/pull/542

Is there any future plan for this? I've been working on Kubeflow's
MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it would
be interesting to see an example of using Horovod + MXNet + Kubeflow using
MPI Operator. Feel free to reach out (@terrytangyuan
<https://github.com/terrytangyuan>) if you encounter any issues.

Best,
Yuan


On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <ap...@gmail.com> wrote:

> Hi Mu,
>
> Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> releasing MXNet-Horovod integration in production. We have made some
> changes on both MXNet and Horovod sides. The changes on MXNet side have
> mostly been merged and we are working to merge code to horovod repo. We
> will send a design doc to you for review again next week.
>
> Thanks for your feedback,
>
> Lin
>
> On Wed, Oct 31, 2018 at 12:03 PM Mu Li <mu...@gmail.com> wrote:
>
> > Thanks for your contribution, Carl.
> >
> > I remember I left a comment on the proposal, but today I found it was
> > disappeared. My suggestion is trying best to not change the existing API.
> > The reason is that we need to change all trainers on the frontend that
> uses
> > the existing kvstore APIs, which may cause confusion to users.
> >
> > The current proposal wants add the following 4 APIs into kvstore:
> >
> >
> >    -
> >
> >    kv.pushpull
> >    -
> >
> >    kv.broadcast
> >    -
> >
> >    kv.local_rank
> >    -
> >
> >    kv.num_local_workers
> >
> >
> > Pushpull can be done with a sequential push and pull, you can do nothing
> in
> > push and put all workloads into pushpull. Broadcast can be implemented by
> > pull.
> >
> > What's local workers? GPUs in the single machine? If so, we can query it
> > directly.
> >
> >
> > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <ca...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Currently, MXNet distributed can only be done using parameter server.
> > > Horovod is an open-source distributed training framework that has
> > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > propose to add Horovod support to MXNet. This will help our users
> > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > proposal on cwiki:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > >
> > > Please feel free to let me know if you have any suggestions or
> feedback.
> > >
> > > Regards,
> > > Carl
> > >
> >
>