You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Adam Sylvester <op...@gmail.com> on 2017/12/03 17:13:01 UTC

Multi-machine jobs

I have a use case where my Scheduler gets an externally-generated request
to produce an image.  This is a CPU-intensive task that I can divide up
into, say, 20 largely independent jobs, and I have an application which can
take in the input filename and which slot out of the 20 it is and produce
1/20th of the output image.  Each job runs on its own machine, using all
CPUs and memory on the machine.  The final output image isn't finished
until all 20 jobs are complete, so I don't want to send an external 'job
complete' message until these 20 jobs all finish.

I can do this in Mesos by accepting 20 resource offers and launching tasks
on them, where each task says it needs all resources on the machine, then
doing bookkeeping on the Scheduler as tasks complete to keep track of when
all 20 finish, at which point I can send my external job complete message.

This is all doable, but there are some obvious complications here (for
example, if any of the 20 jobs fail, I want to fail all 20 jobs, but I have
to keep track of that myself).

AWS Batch has Array Jobs which would give me the kind of functionality I
want (http://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html).
I'm wondering if there's any way to do this - specifically running a single
logical task across multiple machines - using either Mesos or an additional
framework that lives on top of Mesos.

Thanks.
-Adam

Re: Multi-machine jobs

Posted by Mohit Jaggi <mo...@uber.com>.
map-reduce or spark can work.

On Sun, Dec 3, 2017 at 9:13 AM, Adam Sylvester <op...@gmail.com> wrote:

> I have a use case where my Scheduler gets an externally-generated request
> to produce an image.  This is a CPU-intensive task that I can divide up
> into, say, 20 largely independent jobs, and I have an application which can
> take in the input filename and which slot out of the 20 it is and produce
> 1/20th of the output image.  Each job runs on its own machine, using all
> CPUs and memory on the machine.  The final output image isn't finished
> until all 20 jobs are complete, so I don't want to send an external 'job
> complete' message until these 20 jobs all finish.
>
> I can do this in Mesos by accepting 20 resource offers and launching tasks
> on them, where each task says it needs all resources on the machine, then
> doing bookkeeping on the Scheduler as tasks complete to keep track of when
> all 20 finish, at which point I can send my external job complete message.
>
> This is all doable, but there are some obvious complications here (for
> example, if any of the 20 jobs fail, I want to fail all 20 jobs, but I have
> to keep track of that myself).
>
> AWS Batch has Array Jobs which would give me the kind of functionality I
> want (http://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html).
> I'm wondering if there's any way to do this - specifically running a single
> logical task across multiple machines - using either Mesos or an additional
> framework that lives on top of Mesos.
>
> Thanks.
> -Adam
>