You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Maximilian Michels <mx...@apache.org> on 2015/05/13 10:34:51 UTC

Re: Flink's multi-user support

I think we can agree that real multi-user support in Flink (standalone) is
neither desirable, because there are already sophisticated solutions out
there (YARN or Mesos), nor feasible because it is a lot of work to get it
right.

At the current state of affairs, resource sharing between two users
submitting a job at the same time, is not properly handled. However, this
discussion showed that it is desirable to have support for submitting
multiple job to a single Flink cluster. This could be realized using a
simple queuing system in which jobs are executed one after another.

In case of the soon to be supported resuming of jobs from intermediate
results, this should still enable multiple clients to refer to past jobs.
The job manager simply holds a list of old ExecutionGraphs for each user
session. When the user ends the session or a timeout occurs, the
corresponding graph is archived. This poses some sort of session management.

tl;dr I propose to drop the multi-user support that we have now. Instead,
let's have a one-job-at-a-time usage model with a queuing system and
eventually a session management to deal with resuming from already
materialized results.

What do you think?

On Thu, Apr 30, 2015 at 11:09 AM, Flavio Pompermaier <po...@okkam.it>
wrote:

> There was an attempt to build such a queue during the Dopa project when
> Flink was still Stratosphere.
> Probably it could be a good idea to collect the good and bad things learned
> from it to start designing the new scheduler :)
>
> On Thu, Apr 30, 2015 at 10:08 AM, Stephan Ewen <se...@apache.org> wrote:
>
> > Most components are written multi-job aware.
> >
> > The only thing that is not in there right now is scheduling policies for
> > fair resource sharing. This is important in shared clusters.
> >
> > Since YARN implements all those things (various job queues with different
> > priorities/policies etc), I suggest to not try and re-build it in Flink
> and
> > simply declare a JobManager a "single-job-at-a-time" manager. You can
> still
> > run an interactive session with many jobs one after another.
> >
> >
> > On Wed, Apr 29, 2015 at 7:07 PM, Maximilian Michels <mx...@apache.org>
> > wrote:
> >
> > > >
> > > > However, dropping it completely instead of improving it would make
> > Flink
> > > > setups on dedicated clusters quite useless, right?
> > > >
> > >
> > > Not really, because you could also use YARN on dedicated clusters for
> > > proper multi-user support.
> > >
> > > On Wed, Apr 29, 2015 at 5:51 PM, Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > > > I agree that Flink's multi-user support is not very good at the
> moment.
> > > > However, dropping it completely instead of improving it would make
> > Flink
> > > > setups on dedicated clusters quite useless, right?
> > > >
> > > >
> > > > 2015-04-29 17:33 GMT+02:00 Maximilian Michels <mx...@apache.org>:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Currently Flink accepts jobs from multiple clients and executes
> them
> > > > > concurrently if the resource limits are not exceeded. However, the
> > > > > multi-user support is very poor. We don't support queuing of jobs
> and
> > > > > concurrent jobs have to share resources in a nice way. Otherwise,
> > jobs
> > > > will
> > > > > fail.
> > > > >
> > > > > Using YARN, we circumvent these problems because it provides a
> proper
> > > > user
> > > > > and session management. I'm wondering now, should we get rid of the
> > > > pseudo
> > > > > multi-user mode and just support one user per Flink cluster
> instance?
> > > > >
> > > > > Best,
> > > > > Max
> > > > >
> > > > > PS:
> > > > > This question came up when I was working on a pull request to
> support
> > > > > backtracking intermediate results. I need to hold a copy of the
> full
> > > > > previous execution graph to resume from old results. With multiple
> > > users,
> > > > > we have to build in some kind of session management to archive old
> > > > > execution graphs. Otherwise, they will consume too much memory in
> the
> > > job
> > > > > manager.
> > > > >
> > > >
> > >
> >
>

Re: Flink's multi-user support

Posted by Maximilian Michels <mx...@apache.org>.

Yes, should be possible to implement both independently.

On Wed, May 13, 2015 at 11:41 AM, Stephan Ewen <se...@apache.org> wrote:

> On first thought, the sessions and the multi-job vs. job queue question are
> almost two separate issues.
>
> Can you add the sessions without removing the concurrent jobs we currently
> have?
>
> On Wed, May 13, 2015 at 10:34 AM, Maximilian Michels <mx...@apache.org>
> wrote:
>
> > I think we can agree that real multi-user support in Flink (standalone)
> is
> > neither desirable, because there are already sophisticated solutions out
> > there (YARN or Mesos), nor feasible because it is a lot of work to get it
> > right.
> >
> > At the current state of affairs, resource sharing between two users
> > submitting a job at the same time, is not properly handled. However, this
> > discussion showed that it is desirable to have support for submitting
> > multiple job to a single Flink cluster. This could be realized using a
> > simple queuing system in which jobs are executed one after another.
> >
> > In case of the soon to be supported resuming of jobs from intermediate
> > results, this should still enable multiple clients to refer to past jobs.
> > The job manager simply holds a list of old ExecutionGraphs for each user
> > session. When the user ends the session or a timeout occurs, the
> > corresponding graph is archived. This poses some sort of session
> > management.
> >
> > tl;dr I propose to drop the multi-user support that we have now. Instead,
> > let's have a one-job-at-a-time usage model with a queuing system and
> > eventually a session management to deal with resuming from already
> > materialized results.
> >
> > What do you think?
> >
> > On Thu, Apr 30, 2015 at 11:09 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> > >
> > wrote:
> >
> > > There was an attempt to build such a queue during the Dopa project when
> > > Flink was still Stratosphere.
> > > Probably it could be a good idea to collect the good and bad things
> > learned
> > > from it to start designing the new scheduler :)
> > >
> > > On Thu, Apr 30, 2015 at 10:08 AM, Stephan Ewen <se...@apache.org>
> wrote:
> > >
> > > > Most components are written multi-job aware.
> > > >
> > > > The only thing that is not in there right now is scheduling policies
> > for
> > > > fair resource sharing. This is important in shared clusters.
> > > >
> > > > Since YARN implements all those things (various job queues with
> > different
> > > > priorities/policies etc), I suggest to not try and re-build it in
> Flink
> > > and
> > > > simply declare a JobManager a "single-job-at-a-time" manager. You can
> > > still
> > > > run an interactive session with many jobs one after another.
> > > >
> > > >
> > > > On Wed, Apr 29, 2015 at 7:07 PM, Maximilian Michels <mx...@apache.org>
> > > > wrote:
> > > >
> > > > > >
> > > > > > However, dropping it completely instead of improving it would
> make
> > > > Flink
> > > > > > setups on dedicated clusters quite useless, right?
> > > > > >
> > > > >
> > > > > Not really, because you could also use YARN on dedicated clusters
> for
> > > > > proper multi-user support.
> > > > >
> > > > > On Wed, Apr 29, 2015 at 5:51 PM, Fabian Hueske <fh...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > I agree that Flink's multi-user support is not very good at the
> > > moment.
> > > > > > However, dropping it completely instead of improving it would
> make
> > > > Flink
> > > > > > setups on dedicated clusters quite useless, right?
> > > > > >
> > > > > >
> > > > > > 2015-04-29 17:33 GMT+02:00 Maximilian Michels <mx...@apache.org>:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > Currently Flink accepts jobs from multiple clients and executes
> > > them
> > > > > > > concurrently if the resource limits are not exceeded. However,
> > the
> > > > > > > multi-user support is very poor. We don't support queuing of
> jobs
> > > and
> > > > > > > concurrent jobs have to share resources in a nice way.
> Otherwise,
> > > > jobs
> > > > > > will
> > > > > > > fail.
> > > > > > >
> > > > > > > Using YARN, we circumvent these problems because it provides a
> > > proper
> > > > > > user
> > > > > > > and session management. I'm wondering now, should we get rid of
> > the
> > > > > > pseudo
> > > > > > > multi-user mode and just support one user per Flink cluster
> > > instance?
> > > > > > >
> > > > > > > Best,
> > > > > > > Max
> > > > > > >
> > > > > > > PS:
> > > > > > > This question came up when I was working on a pull request to
> > > support
> > > > > > > backtracking intermediate results. I need to hold a copy of the
> > > full
> > > > > > > previous execution graph to resume from old results. With
> > multiple
> > > > > users,
> > > > > > > we have to build in some kind of session management to archive
> > old
> > > > > > > execution graphs. Otherwise, they will consume too much memory
> in
> > > the
> > > > > job
> > > > > > > manager.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Flink's multi-user support

Posted by Stephan Ewen <se...@apache.org>.

On first thought, the sessions and the multi-job vs. job queue question are
almost two separate issues.

Can you add the sessions without removing the concurrent jobs we currently
have?

On Wed, May 13, 2015 at 10:34 AM, Maximilian Michels <mx...@apache.org> wrote:

> I think we can agree that real multi-user support in Flink (standalone) is
> neither desirable, because there are already sophisticated solutions out
> there (YARN or Mesos), nor feasible because it is a lot of work to get it
> right.
>
> At the current state of affairs, resource sharing between two users
> submitting a job at the same time, is not properly handled. However, this
> discussion showed that it is desirable to have support for submitting
> multiple job to a single Flink cluster. This could be realized using a
> simple queuing system in which jobs are executed one after another.
>
> In case of the soon to be supported resuming of jobs from intermediate
> results, this should still enable multiple clients to refer to past jobs.
> The job manager simply holds a list of old ExecutionGraphs for each user
> session. When the user ends the session or a timeout occurs, the
> corresponding graph is archived. This poses some sort of session
> management.
>
> tl;dr I propose to drop the multi-user support that we have now. Instead,
> let's have a one-job-at-a-time usage model with a queuing system and
> eventually a session management to deal with resuming from already
> materialized results.
>
> What do you think?
>
> On Thu, Apr 30, 2015 at 11:09 AM, Flavio Pompermaier <pompermaier@okkam.it
> >
> wrote:
>
> > There was an attempt to build such a queue during the Dopa project when
> > Flink was still Stratosphere.
> > Probably it could be a good idea to collect the good and bad things
> learned
> > from it to start designing the new scheduler :)
> >
> > On Thu, Apr 30, 2015 at 10:08 AM, Stephan Ewen <se...@apache.org> wrote:
> >
> > > Most components are written multi-job aware.
> > >
> > > The only thing that is not in there right now is scheduling policies
> for
> > > fair resource sharing. This is important in shared clusters.
> > >
> > > Since YARN implements all those things (various job queues with
> different
> > > priorities/policies etc), I suggest to not try and re-build it in Flink
> > and
> > > simply declare a JobManager a "single-job-at-a-time" manager. You can
> > still
> > > run an interactive session with many jobs one after another.
> > >
> > >
> > > On Wed, Apr 29, 2015 at 7:07 PM, Maximilian Michels <mx...@apache.org>
> > > wrote:
> > >
> > > > >
> > > > > However, dropping it completely instead of improving it would make
> > > Flink
> > > > > setups on dedicated clusters quite useless, right?
> > > > >
> > > >
> > > > Not really, because you could also use YARN on dedicated clusters for
> > > > proper multi-user support.
> > > >
> > > > On Wed, Apr 29, 2015 at 5:51 PM, Fabian Hueske <fh...@gmail.com>
> > > wrote:
> > > >
> > > > > I agree that Flink's multi-user support is not very good at the
> > moment.
> > > > > However, dropping it completely instead of improving it would make
> > > Flink
> > > > > setups on dedicated clusters quite useless, right?
> > > > >
> > > > >
> > > > > 2015-04-29 17:33 GMT+02:00 Maximilian Michels <mx...@apache.org>:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Currently Flink accepts jobs from multiple clients and executes
> > them
> > > > > > concurrently if the resource limits are not exceeded. However,
> the
> > > > > > multi-user support is very poor. We don't support queuing of jobs
> > and
> > > > > > concurrent jobs have to share resources in a nice way. Otherwise,
> > > jobs
> > > > > will
> > > > > > fail.
> > > > > >
> > > > > > Using YARN, we circumvent these problems because it provides a
> > proper
> > > > > user
> > > > > > and session management. I'm wondering now, should we get rid of
> the
> > > > > pseudo
> > > > > > multi-user mode and just support one user per Flink cluster
> > instance?
> > > > > >
> > > > > > Best,
> > > > > > Max
> > > > > >
> > > > > > PS:
> > > > > > This question came up when I was working on a pull request to
> > support
> > > > > > backtracking intermediate results. I need to hold a copy of the
> > full
> > > > > > previous execution graph to resume from old results. With
> multiple
> > > > users,
> > > > > > we have to build in some kind of session management to archive
> old
> > > > > > execution graphs. Otherwise, they will consume too much memory in
> > the
> > > > job
> > > > > > manager.
> > > > > >
> > > > >
> > > >
> > >
> >
>