You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Mangirish Wagle <va...@gmail.com> on 2016/10/14 06:12:45 UTC

Need inputs on scheduling

Hello Aurora Devs,

I am contributing to Apache Airavata <http://airavata.apache.org/> and
currently working on extending the support for the science gateways to run
MPI jobs on cloud based Mesos clusters.

Is there a way I can achieve this using Apache Aurora? I would really
appreciate if you could share info on any work already being done to
achieve scheduling MPI jobs on Mesos.

Thank you.

Best Regards,
Mangirish Wagle
Graduate Student, Indiana University Bloomington

Re: Need inputs on scheduling

Posted by Mangirish Wagle <va...@gmail.com>.
Hi Stephan,

Thank you very much for those insights. So if I understand it correctly,
the idea here is that the MPI job would be distributed across multiple
Aurora job instances, instead of multiple machines. Also all the MPI jobs
should be scheduled together as one entity (gang scheduling).

One of the Mesos developer pointed me out to a gang scheduler
implementation: https://github.com/nqn/gasc
What I understand is, this gang schedules an MPI job directly over mesos. I
need to understand how advantageous would it be to have Aurora as a backed
for gang scheduling instead of bare mesos? One advantage is Aurora is
tested to be robust and fault tolerant framework over mesos, whereas the
later approach would call for implementing these performance criteria.

Please let me know if you have any more thoughts.

Thanks and Regards,
Mangirish Wagle

On Sun, Oct 16, 2016 at 1:21 PM, Stephan Erb <se...@apache.org> wrote:

> I have used MPI briefly a couple of years ago, and from what I
> remember:
>
> MPI tends to require so-called gang scheduling where all instances of a
> job are scheduled simultaneously. Due to lacking inherent fault
> tolerance of MPI, it is common to abort the entire job (i.e. all
> instances) if a single instance fails. Furthermore, native MPI/HPC
> schedulers tend to support long queues with various fairness mechanisms
> in order to make the gang scheduling efficient.
>
> In contrast, Aurora makes the assumption that individual instances of a
> job can be scheduled and fail independently. This implies that you
> would need some external scaffolding to ensure proper gang scheduling.
> (Disclaimer: I have no idea how difficult this would be)
>
> Aurora is battle-tested. Using it as a backend of HPC/MPI scheduler
> could therefore be worthwhile if you manage to make the scaffolding
> work. In particular, because writing a scalable and fault-tolerant
> Mesos framework can be quite difficult.
>
> Best Regards,
> Stephan
>
>
> On Sa, 2016-10-15 at 12:47 -0400, Mangirish Wagle wrote:
> > Hi Santhosh,
> >
> > Thanks for your response and suggestion. Mesos-hydra is not being
> > used and
> > supported by the community anymore, from what I heard from Mesos
> > developers. But certainly it may be a potential reference to build up
> > upon.
> >
> > My most preferred option would be to use any existing schedulers like
> > Apache Aurora to run MPI. If you have any insights on that, that
> > would be
> > really helpful.
> >
> > Regards,
> > Mangirish
> >
> > On Sat, Oct 15, 2016 at 11:07 AM, Santhosh Kumar Shanmugham <
> > sshanmugham@twitter.com.invalid> wrote:
> >
> > >
> > > Have you checked out https://github.com/mesosphere/mesos-hydra?
> > >
> > > On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <vaglomangirish@gmail.co
> > > m>
> > > wrote:
> > >
> > > >
> > > > Thanks for your response Zameer. I shall check out Apache Aurora
> > > > and
> > > update
> > > >
> > > > if it served the purpose.
> > > >
> > > > On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zm...@apache.org>
> > > > wrote:
> > > >
> > > > >
> > > > > Hey,
> > > > >
> > > > > I am not an expert on MPI jobs, but it seems possible to run
> > > > > them on
> > > > > Aurora. Aurora is a pretty flexible scheduler that lets you run
> > > arbitrary
> > > >
> > > > >
> > > > > binaries or container images. Aurora is designed for long
> > > > > running
> > > > services
> > > > >
> > > > > and assuming that you want to launch workers that are long
> > > > > running, it
> > > > > could solve your problem.
> > > > >
> > > > > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> > > > > vaglomangirish@gmail.com>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Hello Aurora Devs,
> > > > > >
> > > > > > I am contributing to Apache Airavata <http://airavata.apache.
> > > > > > org/>
> > > and
> > > >
> > > > >
> > > > > >
> > > > > > currently working on extending the support for the science
> > > > > > gateways
> > > to
> > > >
> > > > >
> > > > > run
> > > > > >
> > > > > > MPI jobs on cloud based Mesos clusters.
> > > > > >
> > > > > > Is there a way I can achieve this using Apache Aurora? I
> > > > > > would really
> > > > > > appreciate if you could share info on any work already being
> > > > > > done to
> > > > > > achieve scheduling MPI jobs on Mesos.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Best Regards,
> > > > > > Mangirish Wagle
> > > > > > Graduate Student, Indiana University Bloomington
> > > > > >
> > > > > > --
> > > > > > Zameer Manji
> > > > > >
> > > > >
> > > >
> > >
>

Re: Need inputs on scheduling

Posted by Stephan Erb <se...@apache.org>.
I have used MPI briefly a couple of years ago, and from what I
remember:�

MPI tends to require so-called gang scheduling where all instances of a
job are scheduled simultaneously. Due to lacking inherent fault
tolerance of MPI, it is common to abort the entire job (i.e. all
instances) if a single instance fails. Furthermore, native MPI/HPC
schedulers tend to support long queues with various fairness mechanisms
in order to make the gang scheduling efficient.

In contrast, Aurora makes the assumption that individual instances of a
job can be scheduled and fail independently. This implies that you
would need some external scaffolding to ensure proper gang scheduling.
(Disclaimer: I have no idea how difficult this would be)

Aurora is battle-tested. Using it as a backend of HPC/MPI scheduler
could therefore be worthwhile if you manage to make the scaffolding
work. In particular, because writing a scalable and fault-tolerant
Mesos framework can be quite difficult.

Best Regards,
Stephan �


On Sa, 2016-10-15 at 12:47 -0400, Mangirish Wagle wrote:
> Hi Santhosh,
> 
> Thanks for your response and suggestion. Mesos-hydra is not being
> used and
> supported by the community anymore, from what I heard from Mesos
> developers. But certainly it may be a potential reference to build up
> upon.
> 
> My most preferred option would be to use any existing schedulers like
> Apache Aurora to run MPI. If you have any insights on that, that
> would be
> really helpful.
> 
> Regards,
> Mangirish
> 
> On Sat, Oct 15, 2016 at 11:07 AM, Santhosh Kumar Shanmugham <
> sshanmugham@twitter.com.invalid> wrote:
> 
> > 
> > Have you checked out https://github.com/mesosphere/mesos-hydra?
> > 
> > On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <vaglomangirish@gmail.co
> > m>
> > wrote:
> > 
> > > 
> > > Thanks for your response Zameer. I shall check out Apache Aurora
> > > and
> > update
> > > 
> > > if it served the purpose.
> > > 
> > > On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zm...@apache.org>
> > > wrote:
> > > 
> > > > 
> > > > Hey,
> > > > 
> > > > I am not an expert on MPI jobs, but it seems possible to run
> > > > them on
> > > > Aurora. Aurora is a pretty flexible scheduler that lets you run
> > arbitrary
> > > 
> > > > 
> > > > binaries or container images. Aurora is designed for long
> > > > running
> > > services
> > > > 
> > > > and assuming that you want to launch workers that are long
> > > > running, it
> > > > could solve your problem.
> > > > 
> > > > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> > > > vaglomangirish@gmail.com>
> > > > wrote:
> > > > 
> > > > > 
> > > > > Hello Aurora Devs,
> > > > > 
> > > > > I am contributing to Apache Airavata <http://airavata.apache.
> > > > > org/>
> > and
> > > 
> > > > 
> > > > > 
> > > > > currently working on extending the support for the science
> > > > > gateways
> > to
> > > 
> > > > 
> > > > run
> > > > > 
> > > > > MPI jobs on cloud based Mesos clusters.
> > > > > 
> > > > > Is there a way I can achieve this using Apache Aurora? I
> > > > > would really
> > > > > appreciate if you could share info on any work already being
> > > > > done to
> > > > > achieve scheduling MPI jobs on Mesos.
> > > > > 
> > > > > Thank you.
> > > > > 
> > > > > Best Regards,
> > > > > Mangirish Wagle
> > > > > Graduate Student, Indiana University Bloomington
> > > > > 
> > > > > --
> > > > > Zameer Manji
> > > > > 
> > > > 
> > > 
> > 

Re: Need inputs on scheduling

Posted by Mangirish Wagle <va...@gmail.com>.
Hi Santhosh,

Thanks for your response and suggestion. Mesos-hydra is not being used and
supported by the community anymore, from what I heard from Mesos
developers. But certainly it may be a potential reference to build up upon.

My most preferred option would be to use any existing schedulers like
Apache Aurora to run MPI. If you have any insights on that, that would be
really helpful.

Regards,
Mangirish

On Sat, Oct 15, 2016 at 11:07 AM, Santhosh Kumar Shanmugham <
sshanmugham@twitter.com.invalid> wrote:

> Have you checked out https://github.com/mesosphere/mesos-hydra?
>
> On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <va...@gmail.com>
> wrote:
>
> > Thanks for your response Zameer. I shall check out Apache Aurora and
> update
> > if it served the purpose.
> >
> > On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zm...@apache.org> wrote:
> >
> > > Hey,
> > >
> > > I am not an expert on MPI jobs, but it seems possible to run them on
> > > Aurora. Aurora is a pretty flexible scheduler that lets you run
> arbitrary
> > > binaries or container images. Aurora is designed for long running
> > services
> > > and assuming that you want to launch workers that are long running, it
> > > could solve your problem.
> > >
> > > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> > > vaglomangirish@gmail.com>
> > > wrote:
> > >
> > > > Hello Aurora Devs,
> > > >
> > > > I am contributing to Apache Airavata <http://airavata.apache.org/>
> and
> > > > currently working on extending the support for the science gateways
> to
> > > run
> > > > MPI jobs on cloud based Mesos clusters.
> > > >
> > > > Is there a way I can achieve this using Apache Aurora? I would really
> > > > appreciate if you could share info on any work already being done to
> > > > achieve scheduling MPI jobs on Mesos.
> > > >
> > > > Thank you.
> > > >
> > > > Best Regards,
> > > > Mangirish Wagle
> > > > Graduate Student, Indiana University Bloomington
> > > >
> > > > --
> > > > Zameer Manji
> > > >
> > >
> >
>

Re: Need inputs on scheduling

Posted by Santhosh Kumar Shanmugham <ss...@twitter.com.INVALID>.
Have you checked out https://github.com/mesosphere/mesos-hydra?

On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <va...@gmail.com> wrote:

> Thanks for your response Zameer. I shall check out Apache Aurora and update
> if it served the purpose.
>
> On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zm...@apache.org> wrote:
>
> > Hey,
> >
> > I am not an expert on MPI jobs, but it seems possible to run them on
> > Aurora. Aurora is a pretty flexible scheduler that lets you run arbitrary
> > binaries or container images. Aurora is designed for long running
> services
> > and assuming that you want to launch workers that are long running, it
> > could solve your problem.
> >
> > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> > vaglomangirish@gmail.com>
> > wrote:
> >
> > > Hello Aurora Devs,
> > >
> > > I am contributing to Apache Airavata <http://airavata.apache.org/> and
> > > currently working on extending the support for the science gateways to
> > run
> > > MPI jobs on cloud based Mesos clusters.
> > >
> > > Is there a way I can achieve this using Apache Aurora? I would really
> > > appreciate if you could share info on any work already being done to
> > > achieve scheduling MPI jobs on Mesos.
> > >
> > > Thank you.
> > >
> > > Best Regards,
> > > Mangirish Wagle
> > > Graduate Student, Indiana University Bloomington
> > >
> > > --
> > > Zameer Manji
> > >
> >
>

Re: Need inputs on scheduling

Posted by Mangirish Wagle <va...@gmail.com>.
Thanks for your response Zameer. I shall check out Apache Aurora and update
if it served the purpose.

On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zm...@apache.org> wrote:

> Hey,
>
> I am not an expert on MPI jobs, but it seems possible to run them on
> Aurora. Aurora is a pretty flexible scheduler that lets you run arbitrary
> binaries or container images. Aurora is designed for long running services
> and assuming that you want to launch workers that are long running, it
> could solve your problem.
>
> On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> vaglomangirish@gmail.com>
> wrote:
>
> > Hello Aurora Devs,
> >
> > I am contributing to Apache Airavata <http://airavata.apache.org/> and
> > currently working on extending the support for the science gateways to
> run
> > MPI jobs on cloud based Mesos clusters.
> >
> > Is there a way I can achieve this using Apache Aurora? I would really
> > appreciate if you could share info on any work already being done to
> > achieve scheduling MPI jobs on Mesos.
> >
> > Thank you.
> >
> > Best Regards,
> > Mangirish Wagle
> > Graduate Student, Indiana University Bloomington
> >
> > --
> > Zameer Manji
> >
>

Re: Need inputs on scheduling

Posted by Zameer Manji <zm...@apache.org>.
Hey,

I am not an expert on MPI jobs, but it seems possible to run them on
Aurora. Aurora is a pretty flexible scheduler that lets you run arbitrary
binaries or container images. Aurora is designed for long running services
and assuming that you want to launch workers that are long running, it
could solve your problem.

On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <va...@gmail.com>
wrote:

> Hello Aurora Devs,
>
> I am contributing to Apache Airavata <http://airavata.apache.org/> and
> currently working on extending the support for the science gateways to run
> MPI jobs on cloud based Mesos clusters.
>
> Is there a way I can achieve this using Apache Aurora? I would really
> appreciate if you could share info on any work already being done to
> achieve scheduling MPI jobs on Mesos.
>
> Thank you.
>
> Best Regards,
> Mangirish Wagle
> Graduate Student, Indiana University Bloomington
>
> --
> Zameer Manji
>