You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Stephan Ewen <se...@apache.org> on 2015/05/21 15:12:24 UTC

[DISCUSS] Dedicated streaming mode

Hi all!

We discussed a while back about introducing a dedicated streaming mode for
Flink. I would like to take a go at this and implement the changes, but
discuss them before.


Here is a brief summary why we wanted to introduce the dedicated streaming
mode:
Even though both batch and streaming are executed by the same execution
engine,
a streaming setup of Flink varies a bit from a batch setup:

1) The streaming cluster starts an additional service to store the
distributed state snapshots.

2) Streaming mode uses memory a bit different, so we should configure the
memory manager differently. This difference may eventually go away.



Concretely, to implement this, I was thinking about introducing the
following externally visible changes

 - Additional scripts "start-streaming-cluster.sh" and
"start-streaming-local.sh"

 - An execution mode parameter for the TaskManager ("batch / streaming")

 - An execution mode parameter for the JobManager TaskManager ("batch /
streaming")

 - All local executors and mini clusters need a flag that specifies whether
they will start
   a streaming cluster, or a pure batch cluster.


Anything else that comes to your minds?


Greetings,
Stephan

Re: [DISCUSS] Dedicated streaming mode

Posted by Maximilian Michels <mx...@apache.org>.
Hi Henry!

I think the idea was to have a dedicated streaming mode as long as the
default cluster mode does not support batch and streaming equally well.
Once we have reached this level in the dedicated streaming mode, this will
be the default cluster mode. I share your doubts about whether it is a good
idea to advertise the streaming mode. It might let people think that a
Flink cluster can only do either one of the modes.

Best,
Max

On Tue, May 26, 2015 at 8:53 PM, Henry Saputra <he...@gmail.com>
wrote:

> Ah yes, technically the streaming mode could run batch jobs as well in
> Flink.
> I am thinking that it could cause confusion with users since most
> systems that does batch and stream (well, pretty much Spark ^_^) does
> not differentiate the deployment topologies for the cluster to support
> different modes of applications.
>
> - Henry
>
> On Tue, May 26, 2015 at 11:44 AM, Stephan Ewen <se...@apache.org> wrote:
> > The streaming mode runs batch jobs as well :-)
> >
> > There should be slightly reduced predictability in the memory management
> in
> > the streaming mode, but otherwise there should not be a problem.
> >
> > So if you want to run mixed workloads, you start the streaming mode.
> >
> >
> > (Note: Currently, the batch mode runs streaming jobs as well, but gives
> > them very little memory. I am thinking of prohibiting that (separate
> > discussion), to prevent people from not noticing that and running a
> highly
> > sub-optimal Flink setup.)
> >
> >
> > On Tue, May 26, 2015 at 8:26 PM, Henry Saputra <he...@gmail.com>
> > wrote:
> >
> >> One immediate concern I have is the deployment topology. With
> >> streaming has its own cluster deployment, this means that in
> >> standalone mode, if ops would like to deploy Flink it has to know what
> >> mode it needs to deploy Flink as, either batch or Streaming. So, if
> >> the use case was to support both batch and streaming, would that mean
> >> the deployment need to separate 2 clusters to support different
> >> applications to run on Flink?
> >>
> >> I think this would be ok if Flink is deployed in YARN or other
> >> resource management platforms like Mesos or Apache Myriad. Maybe
> >> someone, like Robert, could confirm this is the case.
> >>
> >> - Henry
> >>
> >> On Tue, May 26, 2015 at 1:51 AM, Maximilian Michels <mx...@apache.org>
> >> wrote:
> >> > +1 great changes coming up! I like the idea that, ultimately, Flink
> will
> >> > handle streaming and batch programs equally well independently of the
> >> > chosen cluster startup mode.
> >> >
> >> > What is the time frame for these changes?
> >> >
> >> > On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <
> henry.saputra@gmail.com>
> >> > wrote:
> >> >
> >> >> Thanks Aljoscha and Stephan, this helps
> >> >>
> >> >> - Henry
> >> >>
> >> >> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org>
> wrote:
> >> >> > Aljoscha is right. There are plans to migrate the streaming state
> to
> >> the
> >> >> > MemoryManager as well, but streaming state is not managed at this
> >> point.
> >> >> >
> >> >> > What is managed in streaming jobs is the data buffered and cached
> in
> >> the
> >> >> > network stack. But that is a different memory pool than the memory
> >> >> manager.
> >> >> > We keep those pools separate because the network stack is currently
> >> more
> >> >> > advanced in terms of dynamically rebalancing memory, compared to
> the
> >> >> memory
> >> >> > manager.
> >> >> >
> >> >> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <
> >> aljoscha@apache.org>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >> streaming currently does not use any memory manager. All state is
> >> kept
> >> >> >> in Java Objects on the Java Heap, for example an ArrayList<> for
> the
> >> >> >> window buffer.
> >> >> >>
> >> >> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <
> >> >> henry.saputra@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi Stephan, Gyula, Paris,
> >> >> >> >
> >> >> >> > How does streaming currently different in term of memory
> >> management?
> >> >> >> > Currently we only have one MemoryManager which is used by both
> >> modes I
> >> >> >> > believe.
> >> >> >> >
> >> >> >> > - Henry
> >> >> >> >
> >> >> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <
> sewen@apache.org>
> >> >> wrote:
> >> >> >> >> I discussed a bit via Skype with Gyula and Paris.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> We thought about the following way to do it:
> >> >> >> >>
> >> >> >> >>  - We add a dedicated streaming mode for now. The streaming
> mode
> >> >> >> supersedes
> >> >> >> >> the batch mode, so it can run both type of programs.
> >> >> >> >>
> >> >> >> >>  - The streaming mode sets the memory manager to "lazy
> >> allocation".
> >> >> >> >>     -> So long as it runs pure streaming jobs, the full heap
> will
> >> be
> >> >> >> >> available to window buffers and UDFs.
> >> >> >> >>     -> Batch programs can still run, so mixed workloads are not
> >> >> >> prevented.
> >> >> >> >> Batch programs are a bit less robust there, because the memory
> >> >> manager
> >> >> >> does
> >> >> >> >> not pre-allocate memory. UDFs can eat into Flink's memory
> portion.
> >> >> >> >>
> >> >> >> >>  - The streaming mode starts the necessary configured
> >> >> >> components/services
> >> >> >> >> for state backups
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Over the next versions, we want to bring these things together:
> >> >> >> >>   - use the managed memory for window buffers
> >> >> >> >>   - on-demand starting of the state backend
> >> >> >> >>
> >> >> >> >> Then, we deprecate the streaming mode, let both modes start the
> >> >> cluster
> >> >> >> in
> >> >> >> >> the same way.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <
> >> >> aljoscha@apache.org>
> >> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >>> Would it not be possible to start the snapshot service once
> the
> >> user
> >> >> >> >>> starts the first streaming job? About 2) with checkpointing
> >> coming
> >> >> up,
> >> >> >> >>> would it not make sense to shift to managed memory rather
> sooner
> >> >> than
> >> >> >> >>> later. Then this point would become moot.
> >> >> >> >>>
> >> >> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> >> >> >> >>> <mj...@informatik.hu-berlin.de> wrote:
> >> >> >> >>> > What would be the consequences on "mixed" programs? (If
> there
> >> is
> >> >> any
> >> >> >> >>> > plan to support those?)
> >> >> >> >>> >
> >> >> >> >>> > Would it be necessary to have a third mode? Or would those
> >> >> programs
> >> >> >> >>> > simple run in streaming mode?
> >> >> >> >>> >
> >> >> >> >>> > -Matthias
> >> >> >> >>> >
> >> >> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> >> >> >> >>> >> Hi all!
> >> >> >> >>> >>
> >> >> >> >>> >> We discussed a while back about introducing a dedicated
> >> streaming
> >> >> >> mode
> >> >> >> >>> for
> >> >> >> >>> >> Flink. I would like to take a go at this and implement the
> >> >> changes,
> >> >> >> but
> >> >> >> >>> >> discuss them before.
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> Here is a brief summary why we wanted to introduce the
> >> dedicated
> >> >> >> >>> streaming
> >> >> >> >>> >> mode:
> >> >> >> >>> >> Even though both batch and streaming are executed by the
> same
> >> >> >> execution
> >> >> >> >>> >> engine,
> >> >> >> >>> >> a streaming setup of Flink varies a bit from a batch setup:
> >> >> >> >>> >>
> >> >> >> >>> >> 1) The streaming cluster starts an additional service to
> store
> >> >> the
> >> >> >> >>> >> distributed state snapshots.
> >> >> >> >>> >>
> >> >> >> >>> >> 2) Streaming mode uses memory a bit different, so we should
> >> >> >> configure
> >> >> >> >>> the
> >> >> >> >>> >> memory manager differently. This difference may eventually
> go
> >> >> away.
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> Concretely, to implement this, I was thinking about
> >> introducing
> >> >> the
> >> >> >> >>> >> following externally visible changes
> >> >> >> >>> >>
> >> >> >> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
> >> >> >> >>> >> "start-streaming-local.sh"
> >> >> >> >>> >>
> >> >> >> >>> >>  - An execution mode parameter for the TaskManager ("batch
> /
> >> >> >> streaming")
> >> >> >> >>> >>
> >> >> >> >>> >>  - An execution mode parameter for the JobManager
> TaskManager
> >> >> >> ("batch /
> >> >> >> >>> >> streaming")
> >> >> >> >>> >>
> >> >> >> >>> >>  - All local executors and mini clusters need a flag that
> >> >> specifies
> >> >> >> >>> whether
> >> >> >> >>> >> they will start
> >> >> >> >>> >>    a streaming cluster, or a pure batch cluster.
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> Anything else that comes to your minds?
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> Greetings,
> >> >> >> >>> >> Stephan
> >> >> >> >>> >>
> >> >> >> >>> >
> >> >> >> >>>
> >> >> >>
> >> >>
> >>
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Henry Saputra <he...@gmail.com>.
Ah yes, technically the streaming mode could run batch jobs as well in Flink.
I am thinking that it could cause confusion with users since most
systems that does batch and stream (well, pretty much Spark ^_^) does
not differentiate the deployment topologies for the cluster to support
different modes of applications.

- Henry

On Tue, May 26, 2015 at 11:44 AM, Stephan Ewen <se...@apache.org> wrote:
> The streaming mode runs batch jobs as well :-)
>
> There should be slightly reduced predictability in the memory management in
> the streaming mode, but otherwise there should not be a problem.
>
> So if you want to run mixed workloads, you start the streaming mode.
>
>
> (Note: Currently, the batch mode runs streaming jobs as well, but gives
> them very little memory. I am thinking of prohibiting that (separate
> discussion), to prevent people from not noticing that and running a highly
> sub-optimal Flink setup.)
>
>
> On Tue, May 26, 2015 at 8:26 PM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> One immediate concern I have is the deployment topology. With
>> streaming has its own cluster deployment, this means that in
>> standalone mode, if ops would like to deploy Flink it has to know what
>> mode it needs to deploy Flink as, either batch or Streaming. So, if
>> the use case was to support both batch and streaming, would that mean
>> the deployment need to separate 2 clusters to support different
>> applications to run on Flink?
>>
>> I think this would be ok if Flink is deployed in YARN or other
>> resource management platforms like Mesos or Apache Myriad. Maybe
>> someone, like Robert, could confirm this is the case.
>>
>> - Henry
>>
>> On Tue, May 26, 2015 at 1:51 AM, Maximilian Michels <mx...@apache.org>
>> wrote:
>> > +1 great changes coming up! I like the idea that, ultimately, Flink will
>> > handle streaming and batch programs equally well independently of the
>> > chosen cluster startup mode.
>> >
>> > What is the time frame for these changes?
>> >
>> > On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <he...@gmail.com>
>> > wrote:
>> >
>> >> Thanks Aljoscha and Stephan, this helps
>> >>
>> >> - Henry
>> >>
>> >> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote:
>> >> > Aljoscha is right. There are plans to migrate the streaming state to
>> the
>> >> > MemoryManager as well, but streaming state is not managed at this
>> point.
>> >> >
>> >> > What is managed in streaming jobs is the data buffered and cached in
>> the
>> >> > network stack. But that is a different memory pool than the memory
>> >> manager.
>> >> > We keep those pools separate because the network stack is currently
>> more
>> >> > advanced in terms of dynamically rebalancing memory, compared to the
>> >> memory
>> >> > manager.
>> >> >
>> >> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <
>> aljoscha@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Hi,
>> >> >> streaming currently does not use any memory manager. All state is
>> kept
>> >> >> in Java Objects on the Java Heap, for example an ArrayList<> for the
>> >> >> window buffer.
>> >> >>
>> >> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <
>> >> henry.saputra@gmail.com>
>> >> >> wrote:
>> >> >> > Hi Stephan, Gyula, Paris,
>> >> >> >
>> >> >> > How does streaming currently different in term of memory
>> management?
>> >> >> > Currently we only have one MemoryManager which is used by both
>> modes I
>> >> >> > believe.
>> >> >> >
>> >> >> > - Henry
>> >> >> >
>> >> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org>
>> >> wrote:
>> >> >> >> I discussed a bit via Skype with Gyula and Paris.
>> >> >> >>
>> >> >> >>
>> >> >> >> We thought about the following way to do it:
>> >> >> >>
>> >> >> >>  - We add a dedicated streaming mode for now. The streaming mode
>> >> >> supersedes
>> >> >> >> the batch mode, so it can run both type of programs.
>> >> >> >>
>> >> >> >>  - The streaming mode sets the memory manager to "lazy
>> allocation".
>> >> >> >>     -> So long as it runs pure streaming jobs, the full heap will
>> be
>> >> >> >> available to window buffers and UDFs.
>> >> >> >>     -> Batch programs can still run, so mixed workloads are not
>> >> >> prevented.
>> >> >> >> Batch programs are a bit less robust there, because the memory
>> >> manager
>> >> >> does
>> >> >> >> not pre-allocate memory. UDFs can eat into Flink's memory portion.
>> >> >> >>
>> >> >> >>  - The streaming mode starts the necessary configured
>> >> >> components/services
>> >> >> >> for state backups
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Over the next versions, we want to bring these things together:
>> >> >> >>   - use the managed memory for window buffers
>> >> >> >>   - on-demand starting of the state backend
>> >> >> >>
>> >> >> >> Then, we deprecate the streaming mode, let both modes start the
>> >> cluster
>> >> >> in
>> >> >> >> the same way.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <
>> >> aljoscha@apache.org>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >>> Would it not be possible to start the snapshot service once the
>> user
>> >> >> >>> starts the first streaming job? About 2) with checkpointing
>> coming
>> >> up,
>> >> >> >>> would it not make sense to shift to managed memory rather sooner
>> >> than
>> >> >> >>> later. Then this point would become moot.
>> >> >> >>>
>> >> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
>> >> >> >>> <mj...@informatik.hu-berlin.de> wrote:
>> >> >> >>> > What would be the consequences on "mixed" programs? (If there
>> is
>> >> any
>> >> >> >>> > plan to support those?)
>> >> >> >>> >
>> >> >> >>> > Would it be necessary to have a third mode? Or would those
>> >> programs
>> >> >> >>> > simple run in streaming mode?
>> >> >> >>> >
>> >> >> >>> > -Matthias
>> >> >> >>> >
>> >> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
>> >> >> >>> >> Hi all!
>> >> >> >>> >>
>> >> >> >>> >> We discussed a while back about introducing a dedicated
>> streaming
>> >> >> mode
>> >> >> >>> for
>> >> >> >>> >> Flink. I would like to take a go at this and implement the
>> >> changes,
>> >> >> but
>> >> >> >>> >> discuss them before.
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> Here is a brief summary why we wanted to introduce the
>> dedicated
>> >> >> >>> streaming
>> >> >> >>> >> mode:
>> >> >> >>> >> Even though both batch and streaming are executed by the same
>> >> >> execution
>> >> >> >>> >> engine,
>> >> >> >>> >> a streaming setup of Flink varies a bit from a batch setup:
>> >> >> >>> >>
>> >> >> >>> >> 1) The streaming cluster starts an additional service to store
>> >> the
>> >> >> >>> >> distributed state snapshots.
>> >> >> >>> >>
>> >> >> >>> >> 2) Streaming mode uses memory a bit different, so we should
>> >> >> configure
>> >> >> >>> the
>> >> >> >>> >> memory manager differently. This difference may eventually go
>> >> away.
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> Concretely, to implement this, I was thinking about
>> introducing
>> >> the
>> >> >> >>> >> following externally visible changes
>> >> >> >>> >>
>> >> >> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
>> >> >> >>> >> "start-streaming-local.sh"
>> >> >> >>> >>
>> >> >> >>> >>  - An execution mode parameter for the TaskManager ("batch /
>> >> >> streaming")
>> >> >> >>> >>
>> >> >> >>> >>  - An execution mode parameter for the JobManager TaskManager
>> >> >> ("batch /
>> >> >> >>> >> streaming")
>> >> >> >>> >>
>> >> >> >>> >>  - All local executors and mini clusters need a flag that
>> >> specifies
>> >> >> >>> whether
>> >> >> >>> >> they will start
>> >> >> >>> >>    a streaming cluster, or a pure batch cluster.
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> Anything else that comes to your minds?
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> Greetings,
>> >> >> >>> >> Stephan
>> >> >> >>> >>
>> >> >> >>> >
>> >> >> >>>
>> >> >>
>> >>
>>

Re: [DISCUSS] Dedicated streaming mode

Posted by Stephan Ewen <se...@apache.org>.
The streaming mode runs batch jobs as well :-)

There should be slightly reduced predictability in the memory management in
the streaming mode, but otherwise there should not be a problem.

So if you want to run mixed workloads, you start the streaming mode.


(Note: Currently, the batch mode runs streaming jobs as well, but gives
them very little memory. I am thinking of prohibiting that (separate
discussion), to prevent people from not noticing that and running a highly
sub-optimal Flink setup.)


On Tue, May 26, 2015 at 8:26 PM, Henry Saputra <he...@gmail.com>
wrote:

> One immediate concern I have is the deployment topology. With
> streaming has its own cluster deployment, this means that in
> standalone mode, if ops would like to deploy Flink it has to know what
> mode it needs to deploy Flink as, either batch or Streaming. So, if
> the use case was to support both batch and streaming, would that mean
> the deployment need to separate 2 clusters to support different
> applications to run on Flink?
>
> I think this would be ok if Flink is deployed in YARN or other
> resource management platforms like Mesos or Apache Myriad. Maybe
> someone, like Robert, could confirm this is the case.
>
> - Henry
>
> On Tue, May 26, 2015 at 1:51 AM, Maximilian Michels <mx...@apache.org>
> wrote:
> > +1 great changes coming up! I like the idea that, ultimately, Flink will
> > handle streaming and batch programs equally well independently of the
> > chosen cluster startup mode.
> >
> > What is the time frame for these changes?
> >
> > On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <he...@gmail.com>
> > wrote:
> >
> >> Thanks Aljoscha and Stephan, this helps
> >>
> >> - Henry
> >>
> >> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote:
> >> > Aljoscha is right. There are plans to migrate the streaming state to
> the
> >> > MemoryManager as well, but streaming state is not managed at this
> point.
> >> >
> >> > What is managed in streaming jobs is the data buffered and cached in
> the
> >> > network stack. But that is a different memory pool than the memory
> >> manager.
> >> > We keep those pools separate because the network stack is currently
> more
> >> > advanced in terms of dynamically rebalancing memory, compared to the
> >> memory
> >> > manager.
> >> >
> >> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <
> aljoscha@apache.org>
> >> > wrote:
> >> >
> >> >> Hi,
> >> >> streaming currently does not use any memory manager. All state is
> kept
> >> >> in Java Objects on the Java Heap, for example an ArrayList<> for the
> >> >> window buffer.
> >> >>
> >> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <
> >> henry.saputra@gmail.com>
> >> >> wrote:
> >> >> > Hi Stephan, Gyula, Paris,
> >> >> >
> >> >> > How does streaming currently different in term of memory
> management?
> >> >> > Currently we only have one MemoryManager which is used by both
> modes I
> >> >> > believe.
> >> >> >
> >> >> > - Henry
> >> >> >
> >> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org>
> >> wrote:
> >> >> >> I discussed a bit via Skype with Gyula and Paris.
> >> >> >>
> >> >> >>
> >> >> >> We thought about the following way to do it:
> >> >> >>
> >> >> >>  - We add a dedicated streaming mode for now. The streaming mode
> >> >> supersedes
> >> >> >> the batch mode, so it can run both type of programs.
> >> >> >>
> >> >> >>  - The streaming mode sets the memory manager to "lazy
> allocation".
> >> >> >>     -> So long as it runs pure streaming jobs, the full heap will
> be
> >> >> >> available to window buffers and UDFs.
> >> >> >>     -> Batch programs can still run, so mixed workloads are not
> >> >> prevented.
> >> >> >> Batch programs are a bit less robust there, because the memory
> >> manager
> >> >> does
> >> >> >> not pre-allocate memory. UDFs can eat into Flink's memory portion.
> >> >> >>
> >> >> >>  - The streaming mode starts the necessary configured
> >> >> components/services
> >> >> >> for state backups
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Over the next versions, we want to bring these things together:
> >> >> >>   - use the managed memory for window buffers
> >> >> >>   - on-demand starting of the state backend
> >> >> >>
> >> >> >> Then, we deprecate the streaming mode, let both modes start the
> >> cluster
> >> >> in
> >> >> >> the same way.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <
> >> aljoscha@apache.org>
> >> >> >> wrote:
> >> >> >>
> >> >> >>> Would it not be possible to start the snapshot service once the
> user
> >> >> >>> starts the first streaming job? About 2) with checkpointing
> coming
> >> up,
> >> >> >>> would it not make sense to shift to managed memory rather sooner
> >> than
> >> >> >>> later. Then this point would become moot.
> >> >> >>>
> >> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> >> >> >>> <mj...@informatik.hu-berlin.de> wrote:
> >> >> >>> > What would be the consequences on "mixed" programs? (If there
> is
> >> any
> >> >> >>> > plan to support those?)
> >> >> >>> >
> >> >> >>> > Would it be necessary to have a third mode? Or would those
> >> programs
> >> >> >>> > simple run in streaming mode?
> >> >> >>> >
> >> >> >>> > -Matthias
> >> >> >>> >
> >> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> >> >> >>> >> Hi all!
> >> >> >>> >>
> >> >> >>> >> We discussed a while back about introducing a dedicated
> streaming
> >> >> mode
> >> >> >>> for
> >> >> >>> >> Flink. I would like to take a go at this and implement the
> >> changes,
> >> >> but
> >> >> >>> >> discuss them before.
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> Here is a brief summary why we wanted to introduce the
> dedicated
> >> >> >>> streaming
> >> >> >>> >> mode:
> >> >> >>> >> Even though both batch and streaming are executed by the same
> >> >> execution
> >> >> >>> >> engine,
> >> >> >>> >> a streaming setup of Flink varies a bit from a batch setup:
> >> >> >>> >>
> >> >> >>> >> 1) The streaming cluster starts an additional service to store
> >> the
> >> >> >>> >> distributed state snapshots.
> >> >> >>> >>
> >> >> >>> >> 2) Streaming mode uses memory a bit different, so we should
> >> >> configure
> >> >> >>> the
> >> >> >>> >> memory manager differently. This difference may eventually go
> >> away.
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> Concretely, to implement this, I was thinking about
> introducing
> >> the
> >> >> >>> >> following externally visible changes
> >> >> >>> >>
> >> >> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
> >> >> >>> >> "start-streaming-local.sh"
> >> >> >>> >>
> >> >> >>> >>  - An execution mode parameter for the TaskManager ("batch /
> >> >> streaming")
> >> >> >>> >>
> >> >> >>> >>  - An execution mode parameter for the JobManager TaskManager
> >> >> ("batch /
> >> >> >>> >> streaming")
> >> >> >>> >>
> >> >> >>> >>  - All local executors and mini clusters need a flag that
> >> specifies
> >> >> >>> whether
> >> >> >>> >> they will start
> >> >> >>> >>    a streaming cluster, or a pure batch cluster.
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> Anything else that comes to your minds?
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> Greetings,
> >> >> >>> >> Stephan
> >> >> >>> >>
> >> >> >>> >
> >> >> >>>
> >> >>
> >>
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Henry Saputra <he...@gmail.com>.
One immediate concern I have is the deployment topology. With
streaming has its own cluster deployment, this means that in
standalone mode, if ops would like to deploy Flink it has to know what
mode it needs to deploy Flink as, either batch or Streaming. So, if
the use case was to support both batch and streaming, would that mean
the deployment need to separate 2 clusters to support different
applications to run on Flink?

I think this would be ok if Flink is deployed in YARN or other
resource management platforms like Mesos or Apache Myriad. Maybe
someone, like Robert, could confirm this is the case.

- Henry

On Tue, May 26, 2015 at 1:51 AM, Maximilian Michels <mx...@apache.org> wrote:
> +1 great changes coming up! I like the idea that, ultimately, Flink will
> handle streaming and batch programs equally well independently of the
> chosen cluster startup mode.
>
> What is the time frame for these changes?
>
> On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> Thanks Aljoscha and Stephan, this helps
>>
>> - Henry
>>
>> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote:
>> > Aljoscha is right. There are plans to migrate the streaming state to the
>> > MemoryManager as well, but streaming state is not managed at this point.
>> >
>> > What is managed in streaming jobs is the data buffered and cached in the
>> > network stack. But that is a different memory pool than the memory
>> manager.
>> > We keep those pools separate because the network stack is currently more
>> > advanced in terms of dynamically rebalancing memory, compared to the
>> memory
>> > manager.
>> >
>> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <al...@apache.org>
>> > wrote:
>> >
>> >> Hi,
>> >> streaming currently does not use any memory manager. All state is kept
>> >> in Java Objects on the Java Heap, for example an ArrayList<> for the
>> >> window buffer.
>> >>
>> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <
>> henry.saputra@gmail.com>
>> >> wrote:
>> >> > Hi Stephan, Gyula, Paris,
>> >> >
>> >> > How does streaming currently different in term of memory management?
>> >> > Currently we only have one MemoryManager which is used by both modes I
>> >> > believe.
>> >> >
>> >> > - Henry
>> >> >
>> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org>
>> wrote:
>> >> >> I discussed a bit via Skype with Gyula and Paris.
>> >> >>
>> >> >>
>> >> >> We thought about the following way to do it:
>> >> >>
>> >> >>  - We add a dedicated streaming mode for now. The streaming mode
>> >> supersedes
>> >> >> the batch mode, so it can run both type of programs.
>> >> >>
>> >> >>  - The streaming mode sets the memory manager to "lazy allocation".
>> >> >>     -> So long as it runs pure streaming jobs, the full heap will be
>> >> >> available to window buffers and UDFs.
>> >> >>     -> Batch programs can still run, so mixed workloads are not
>> >> prevented.
>> >> >> Batch programs are a bit less robust there, because the memory
>> manager
>> >> does
>> >> >> not pre-allocate memory. UDFs can eat into Flink's memory portion.
>> >> >>
>> >> >>  - The streaming mode starts the necessary configured
>> >> components/services
>> >> >> for state backups
>> >> >>
>> >> >>
>> >> >>
>> >> >> Over the next versions, we want to bring these things together:
>> >> >>   - use the managed memory for window buffers
>> >> >>   - on-demand starting of the state backend
>> >> >>
>> >> >> Then, we deprecate the streaming mode, let both modes start the
>> cluster
>> >> in
>> >> >> the same way.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <
>> aljoscha@apache.org>
>> >> >> wrote:
>> >> >>
>> >> >>> Would it not be possible to start the snapshot service once the user
>> >> >>> starts the first streaming job? About 2) with checkpointing coming
>> up,
>> >> >>> would it not make sense to shift to managed memory rather sooner
>> than
>> >> >>> later. Then this point would become moot.
>> >> >>>
>> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
>> >> >>> <mj...@informatik.hu-berlin.de> wrote:
>> >> >>> > What would be the consequences on "mixed" programs? (If there is
>> any
>> >> >>> > plan to support those?)
>> >> >>> >
>> >> >>> > Would it be necessary to have a third mode? Or would those
>> programs
>> >> >>> > simple run in streaming mode?
>> >> >>> >
>> >> >>> > -Matthias
>> >> >>> >
>> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
>> >> >>> >> Hi all!
>> >> >>> >>
>> >> >>> >> We discussed a while back about introducing a dedicated streaming
>> >> mode
>> >> >>> for
>> >> >>> >> Flink. I would like to take a go at this and implement the
>> changes,
>> >> but
>> >> >>> >> discuss them before.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> Here is a brief summary why we wanted to introduce the dedicated
>> >> >>> streaming
>> >> >>> >> mode:
>> >> >>> >> Even though both batch and streaming are executed by the same
>> >> execution
>> >> >>> >> engine,
>> >> >>> >> a streaming setup of Flink varies a bit from a batch setup:
>> >> >>> >>
>> >> >>> >> 1) The streaming cluster starts an additional service to store
>> the
>> >> >>> >> distributed state snapshots.
>> >> >>> >>
>> >> >>> >> 2) Streaming mode uses memory a bit different, so we should
>> >> configure
>> >> >>> the
>> >> >>> >> memory manager differently. This difference may eventually go
>> away.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> Concretely, to implement this, I was thinking about introducing
>> the
>> >> >>> >> following externally visible changes
>> >> >>> >>
>> >> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
>> >> >>> >> "start-streaming-local.sh"
>> >> >>> >>
>> >> >>> >>  - An execution mode parameter for the TaskManager ("batch /
>> >> streaming")
>> >> >>> >>
>> >> >>> >>  - An execution mode parameter for the JobManager TaskManager
>> >> ("batch /
>> >> >>> >> streaming")
>> >> >>> >>
>> >> >>> >>  - All local executors and mini clusters need a flag that
>> specifies
>> >> >>> whether
>> >> >>> >> they will start
>> >> >>> >>    a streaming cluster, or a pure batch cluster.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> Anything else that comes to your minds?
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> Greetings,
>> >> >>> >> Stephan
>> >> >>> >>
>> >> >>> >
>> >> >>>
>> >>
>>

Re: [DISCUSS] Dedicated streaming mode

Posted by Maximilian Michels <mx...@apache.org>.
+1 great changes coming up! I like the idea that, ultimately, Flink will
handle streaming and batch programs equally well independently of the
chosen cluster startup mode.

What is the time frame for these changes?

On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <he...@gmail.com>
wrote:

> Thanks Aljoscha and Stephan, this helps
>
> - Henry
>
> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote:
> > Aljoscha is right. There are plans to migrate the streaming state to the
> > MemoryManager as well, but streaming state is not managed at this point.
> >
> > What is managed in streaming jobs is the data buffered and cached in the
> > network stack. But that is a different memory pool than the memory
> manager.
> > We keep those pools separate because the network stack is currently more
> > advanced in terms of dynamically rebalancing memory, compared to the
> memory
> > manager.
> >
> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> >> Hi,
> >> streaming currently does not use any memory manager. All state is kept
> >> in Java Objects on the Java Heap, for example an ArrayList<> for the
> >> window buffer.
> >>
> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <
> henry.saputra@gmail.com>
> >> wrote:
> >> > Hi Stephan, Gyula, Paris,
> >> >
> >> > How does streaming currently different in term of memory management?
> >> > Currently we only have one MemoryManager which is used by both modes I
> >> > believe.
> >> >
> >> > - Henry
> >> >
> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org>
> wrote:
> >> >> I discussed a bit via Skype with Gyula and Paris.
> >> >>
> >> >>
> >> >> We thought about the following way to do it:
> >> >>
> >> >>  - We add a dedicated streaming mode for now. The streaming mode
> >> supersedes
> >> >> the batch mode, so it can run both type of programs.
> >> >>
> >> >>  - The streaming mode sets the memory manager to "lazy allocation".
> >> >>     -> So long as it runs pure streaming jobs, the full heap will be
> >> >> available to window buffers and UDFs.
> >> >>     -> Batch programs can still run, so mixed workloads are not
> >> prevented.
> >> >> Batch programs are a bit less robust there, because the memory
> manager
> >> does
> >> >> not pre-allocate memory. UDFs can eat into Flink's memory portion.
> >> >>
> >> >>  - The streaming mode starts the necessary configured
> >> components/services
> >> >> for state backups
> >> >>
> >> >>
> >> >>
> >> >> Over the next versions, we want to bring these things together:
> >> >>   - use the managed memory for window buffers
> >> >>   - on-demand starting of the state backend
> >> >>
> >> >> Then, we deprecate the streaming mode, let both modes start the
> cluster
> >> in
> >> >> the same way.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <
> aljoscha@apache.org>
> >> >> wrote:
> >> >>
> >> >>> Would it not be possible to start the snapshot service once the user
> >> >>> starts the first streaming job? About 2) with checkpointing coming
> up,
> >> >>> would it not make sense to shift to managed memory rather sooner
> than
> >> >>> later. Then this point would become moot.
> >> >>>
> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> >> >>> <mj...@informatik.hu-berlin.de> wrote:
> >> >>> > What would be the consequences on "mixed" programs? (If there is
> any
> >> >>> > plan to support those?)
> >> >>> >
> >> >>> > Would it be necessary to have a third mode? Or would those
> programs
> >> >>> > simple run in streaming mode?
> >> >>> >
> >> >>> > -Matthias
> >> >>> >
> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> >> >>> >> Hi all!
> >> >>> >>
> >> >>> >> We discussed a while back about introducing a dedicated streaming
> >> mode
> >> >>> for
> >> >>> >> Flink. I would like to take a go at this and implement the
> changes,
> >> but
> >> >>> >> discuss them before.
> >> >>> >>
> >> >>> >>
> >> >>> >> Here is a brief summary why we wanted to introduce the dedicated
> >> >>> streaming
> >> >>> >> mode:
> >> >>> >> Even though both batch and streaming are executed by the same
> >> execution
> >> >>> >> engine,
> >> >>> >> a streaming setup of Flink varies a bit from a batch setup:
> >> >>> >>
> >> >>> >> 1) The streaming cluster starts an additional service to store
> the
> >> >>> >> distributed state snapshots.
> >> >>> >>
> >> >>> >> 2) Streaming mode uses memory a bit different, so we should
> >> configure
> >> >>> the
> >> >>> >> memory manager differently. This difference may eventually go
> away.
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> Concretely, to implement this, I was thinking about introducing
> the
> >> >>> >> following externally visible changes
> >> >>> >>
> >> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
> >> >>> >> "start-streaming-local.sh"
> >> >>> >>
> >> >>> >>  - An execution mode parameter for the TaskManager ("batch /
> >> streaming")
> >> >>> >>
> >> >>> >>  - An execution mode parameter for the JobManager TaskManager
> >> ("batch /
> >> >>> >> streaming")
> >> >>> >>
> >> >>> >>  - All local executors and mini clusters need a flag that
> specifies
> >> >>> whether
> >> >>> >> they will start
> >> >>> >>    a streaming cluster, or a pure batch cluster.
> >> >>> >>
> >> >>> >>
> >> >>> >> Anything else that comes to your minds?
> >> >>> >>
> >> >>> >>
> >> >>> >> Greetings,
> >> >>> >> Stephan
> >> >>> >>
> >> >>> >
> >> >>>
> >>
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Henry Saputra <he...@gmail.com>.
Thanks Aljoscha and Stephan, this helps

- Henry

On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote:
> Aljoscha is right. There are plans to migrate the streaming state to the
> MemoryManager as well, but streaming state is not managed at this point.
>
> What is managed in streaming jobs is the data buffered and cached in the
> network stack. But that is a different memory pool than the memory manager.
> We keep those pools separate because the network stack is currently more
> advanced in terms of dynamically rebalancing memory, compared to the memory
> manager.
>
> On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Hi,
>> streaming currently does not use any memory manager. All state is kept
>> in Java Objects on the Java Heap, for example an ArrayList<> for the
>> window buffer.
>>
>> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <he...@gmail.com>
>> wrote:
>> > Hi Stephan, Gyula, Paris,
>> >
>> > How does streaming currently different in term of memory management?
>> > Currently we only have one MemoryManager which is used by both modes I
>> > believe.
>> >
>> > - Henry
>> >
>> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org> wrote:
>> >> I discussed a bit via Skype with Gyula and Paris.
>> >>
>> >>
>> >> We thought about the following way to do it:
>> >>
>> >>  - We add a dedicated streaming mode for now. The streaming mode
>> supersedes
>> >> the batch mode, so it can run both type of programs.
>> >>
>> >>  - The streaming mode sets the memory manager to "lazy allocation".
>> >>     -> So long as it runs pure streaming jobs, the full heap will be
>> >> available to window buffers and UDFs.
>> >>     -> Batch programs can still run, so mixed workloads are not
>> prevented.
>> >> Batch programs are a bit less robust there, because the memory manager
>> does
>> >> not pre-allocate memory. UDFs can eat into Flink's memory portion.
>> >>
>> >>  - The streaming mode starts the necessary configured
>> components/services
>> >> for state backups
>> >>
>> >>
>> >>
>> >> Over the next versions, we want to bring these things together:
>> >>   - use the managed memory for window buffers
>> >>   - on-demand starting of the state backend
>> >>
>> >> Then, we deprecate the streaming mode, let both modes start the cluster
>> in
>> >> the same way.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
>> >> wrote:
>> >>
>> >>> Would it not be possible to start the snapshot service once the user
>> >>> starts the first streaming job? About 2) with checkpointing coming up,
>> >>> would it not make sense to shift to managed memory rather sooner than
>> >>> later. Then this point would become moot.
>> >>>
>> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
>> >>> <mj...@informatik.hu-berlin.de> wrote:
>> >>> > What would be the consequences on "mixed" programs? (If there is any
>> >>> > plan to support those?)
>> >>> >
>> >>> > Would it be necessary to have a third mode? Or would those programs
>> >>> > simple run in streaming mode?
>> >>> >
>> >>> > -Matthias
>> >>> >
>> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
>> >>> >> Hi all!
>> >>> >>
>> >>> >> We discussed a while back about introducing a dedicated streaming
>> mode
>> >>> for
>> >>> >> Flink. I would like to take a go at this and implement the changes,
>> but
>> >>> >> discuss them before.
>> >>> >>
>> >>> >>
>> >>> >> Here is a brief summary why we wanted to introduce the dedicated
>> >>> streaming
>> >>> >> mode:
>> >>> >> Even though both batch and streaming are executed by the same
>> execution
>> >>> >> engine,
>> >>> >> a streaming setup of Flink varies a bit from a batch setup:
>> >>> >>
>> >>> >> 1) The streaming cluster starts an additional service to store the
>> >>> >> distributed state snapshots.
>> >>> >>
>> >>> >> 2) Streaming mode uses memory a bit different, so we should
>> configure
>> >>> the
>> >>> >> memory manager differently. This difference may eventually go away.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Concretely, to implement this, I was thinking about introducing the
>> >>> >> following externally visible changes
>> >>> >>
>> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
>> >>> >> "start-streaming-local.sh"
>> >>> >>
>> >>> >>  - An execution mode parameter for the TaskManager ("batch /
>> streaming")
>> >>> >>
>> >>> >>  - An execution mode parameter for the JobManager TaskManager
>> ("batch /
>> >>> >> streaming")
>> >>> >>
>> >>> >>  - All local executors and mini clusters need a flag that specifies
>> >>> whether
>> >>> >> they will start
>> >>> >>    a streaming cluster, or a pure batch cluster.
>> >>> >>
>> >>> >>
>> >>> >> Anything else that comes to your minds?
>> >>> >>
>> >>> >>
>> >>> >> Greetings,
>> >>> >> Stephan
>> >>> >>
>> >>> >
>> >>>
>>

Re: [DISCUSS] Dedicated streaming mode

Posted by Stephan Ewen <se...@apache.org>.
Aljoscha is right. There are plans to migrate the streaming state to the
MemoryManager as well, but streaming state is not managed at this point.

What is managed in streaming jobs is the data buffered and cached in the
network stack. But that is a different memory pool than the memory manager.
We keep those pools separate because the network stack is currently more
advanced in terms of dynamically rebalancing memory, compared to the memory
manager.

On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> streaming currently does not use any memory manager. All state is kept
> in Java Objects on the Java Heap, for example an ArrayList<> for the
> window buffer.
>
> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <he...@gmail.com>
> wrote:
> > Hi Stephan, Gyula, Paris,
> >
> > How does streaming currently different in term of memory management?
> > Currently we only have one MemoryManager which is used by both modes I
> > believe.
> >
> > - Henry
> >
> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org> wrote:
> >> I discussed a bit via Skype with Gyula and Paris.
> >>
> >>
> >> We thought about the following way to do it:
> >>
> >>  - We add a dedicated streaming mode for now. The streaming mode
> supersedes
> >> the batch mode, so it can run both type of programs.
> >>
> >>  - The streaming mode sets the memory manager to "lazy allocation".
> >>     -> So long as it runs pure streaming jobs, the full heap will be
> >> available to window buffers and UDFs.
> >>     -> Batch programs can still run, so mixed workloads are not
> prevented.
> >> Batch programs are a bit less robust there, because the memory manager
> does
> >> not pre-allocate memory. UDFs can eat into Flink's memory portion.
> >>
> >>  - The streaming mode starts the necessary configured
> components/services
> >> for state backups
> >>
> >>
> >>
> >> Over the next versions, we want to bring these things together:
> >>   - use the managed memory for window buffers
> >>   - on-demand starting of the state backend
> >>
> >> Then, we deprecate the streaming mode, let both modes start the cluster
> in
> >> the same way.
> >>
> >>
> >>
> >>
> >>
> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
> >> wrote:
> >>
> >>> Would it not be possible to start the snapshot service once the user
> >>> starts the first streaming job? About 2) with checkpointing coming up,
> >>> would it not make sense to shift to managed memory rather sooner than
> >>> later. Then this point would become moot.
> >>>
> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> >>> <mj...@informatik.hu-berlin.de> wrote:
> >>> > What would be the consequences on "mixed" programs? (If there is any
> >>> > plan to support those?)
> >>> >
> >>> > Would it be necessary to have a third mode? Or would those programs
> >>> > simple run in streaming mode?
> >>> >
> >>> > -Matthias
> >>> >
> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> >>> >> Hi all!
> >>> >>
> >>> >> We discussed a while back about introducing a dedicated streaming
> mode
> >>> for
> >>> >> Flink. I would like to take a go at this and implement the changes,
> but
> >>> >> discuss them before.
> >>> >>
> >>> >>
> >>> >> Here is a brief summary why we wanted to introduce the dedicated
> >>> streaming
> >>> >> mode:
> >>> >> Even though both batch and streaming are executed by the same
> execution
> >>> >> engine,
> >>> >> a streaming setup of Flink varies a bit from a batch setup:
> >>> >>
> >>> >> 1) The streaming cluster starts an additional service to store the
> >>> >> distributed state snapshots.
> >>> >>
> >>> >> 2) Streaming mode uses memory a bit different, so we should
> configure
> >>> the
> >>> >> memory manager differently. This difference may eventually go away.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Concretely, to implement this, I was thinking about introducing the
> >>> >> following externally visible changes
> >>> >>
> >>> >>  - Additional scripts "start-streaming-cluster.sh" and
> >>> >> "start-streaming-local.sh"
> >>> >>
> >>> >>  - An execution mode parameter for the TaskManager ("batch /
> streaming")
> >>> >>
> >>> >>  - An execution mode parameter for the JobManager TaskManager
> ("batch /
> >>> >> streaming")
> >>> >>
> >>> >>  - All local executors and mini clusters need a flag that specifies
> >>> whether
> >>> >> they will start
> >>> >>    a streaming cluster, or a pure batch cluster.
> >>> >>
> >>> >>
> >>> >> Anything else that comes to your minds?
> >>> >>
> >>> >>
> >>> >> Greetings,
> >>> >> Stephan
> >>> >>
> >>> >
> >>>
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Aljoscha Krettek <al...@apache.org>.
Hi,
streaming currently does not use any memory manager. All state is kept
in Java Objects on the Java Heap, for example an ArrayList<> for the
window buffer.

On Thu, May 21, 2015 at 11:56 PM, Henry Saputra <he...@gmail.com> wrote:
> Hi Stephan, Gyula, Paris,
>
> How does streaming currently different in term of memory management?
> Currently we only have one MemoryManager which is used by both modes I
> believe.
>
> - Henry
>
> On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org> wrote:
>> I discussed a bit via Skype with Gyula and Paris.
>>
>>
>> We thought about the following way to do it:
>>
>>  - We add a dedicated streaming mode for now. The streaming mode supersedes
>> the batch mode, so it can run both type of programs.
>>
>>  - The streaming mode sets the memory manager to "lazy allocation".
>>     -> So long as it runs pure streaming jobs, the full heap will be
>> available to window buffers and UDFs.
>>     -> Batch programs can still run, so mixed workloads are not prevented.
>> Batch programs are a bit less robust there, because the memory manager does
>> not pre-allocate memory. UDFs can eat into Flink's memory portion.
>>
>>  - The streaming mode starts the necessary configured components/services
>> for state backups
>>
>>
>>
>> Over the next versions, we want to bring these things together:
>>   - use the managed memory for window buffers
>>   - on-demand starting of the state backend
>>
>> Then, we deprecate the streaming mode, let both modes start the cluster in
>> the same way.
>>
>>
>>
>>
>>
>> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Would it not be possible to start the snapshot service once the user
>>> starts the first streaming job? About 2) with checkpointing coming up,
>>> would it not make sense to shift to managed memory rather sooner than
>>> later. Then this point would become moot.
>>>
>>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
>>> <mj...@informatik.hu-berlin.de> wrote:
>>> > What would be the consequences on "mixed" programs? (If there is any
>>> > plan to support those?)
>>> >
>>> > Would it be necessary to have a third mode? Or would those programs
>>> > simple run in streaming mode?
>>> >
>>> > -Matthias
>>> >
>>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
>>> >> Hi all!
>>> >>
>>> >> We discussed a while back about introducing a dedicated streaming mode
>>> for
>>> >> Flink. I would like to take a go at this and implement the changes, but
>>> >> discuss them before.
>>> >>
>>> >>
>>> >> Here is a brief summary why we wanted to introduce the dedicated
>>> streaming
>>> >> mode:
>>> >> Even though both batch and streaming are executed by the same execution
>>> >> engine,
>>> >> a streaming setup of Flink varies a bit from a batch setup:
>>> >>
>>> >> 1) The streaming cluster starts an additional service to store the
>>> >> distributed state snapshots.
>>> >>
>>> >> 2) Streaming mode uses memory a bit different, so we should configure
>>> the
>>> >> memory manager differently. This difference may eventually go away.
>>> >>
>>> >>
>>> >>
>>> >> Concretely, to implement this, I was thinking about introducing the
>>> >> following externally visible changes
>>> >>
>>> >>  - Additional scripts "start-streaming-cluster.sh" and
>>> >> "start-streaming-local.sh"
>>> >>
>>> >>  - An execution mode parameter for the TaskManager ("batch / streaming")
>>> >>
>>> >>  - An execution mode parameter for the JobManager TaskManager ("batch /
>>> >> streaming")
>>> >>
>>> >>  - All local executors and mini clusters need a flag that specifies
>>> whether
>>> >> they will start
>>> >>    a streaming cluster, or a pure batch cluster.
>>> >>
>>> >>
>>> >> Anything else that comes to your minds?
>>> >>
>>> >>
>>> >> Greetings,
>>> >> Stephan
>>> >>
>>> >
>>>

Re: [DISCUSS] Dedicated streaming mode

Posted by Henry Saputra <he...@gmail.com>.
Hi Stephan, Gyula, Paris,

How does streaming currently different in term of memory management?
Currently we only have one MemoryManager which is used by both modes I
believe.

- Henry

On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org> wrote:
> I discussed a bit via Skype with Gyula and Paris.
>
>
> We thought about the following way to do it:
>
>  - We add a dedicated streaming mode for now. The streaming mode supersedes
> the batch mode, so it can run both type of programs.
>
>  - The streaming mode sets the memory manager to "lazy allocation".
>     -> So long as it runs pure streaming jobs, the full heap will be
> available to window buffers and UDFs.
>     -> Batch programs can still run, so mixed workloads are not prevented.
> Batch programs are a bit less robust there, because the memory manager does
> not pre-allocate memory. UDFs can eat into Flink's memory portion.
>
>  - The streaming mode starts the necessary configured components/services
> for state backups
>
>
>
> Over the next versions, we want to bring these things together:
>   - use the managed memory for window buffers
>   - on-demand starting of the state backend
>
> Then, we deprecate the streaming mode, let both modes start the cluster in
> the same way.
>
>
>
>
>
> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Would it not be possible to start the snapshot service once the user
>> starts the first streaming job? About 2) with checkpointing coming up,
>> would it not make sense to shift to managed memory rather sooner than
>> later. Then this point would become moot.
>>
>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
>> <mj...@informatik.hu-berlin.de> wrote:
>> > What would be the consequences on "mixed" programs? (If there is any
>> > plan to support those?)
>> >
>> > Would it be necessary to have a third mode? Or would those programs
>> > simple run in streaming mode?
>> >
>> > -Matthias
>> >
>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
>> >> Hi all!
>> >>
>> >> We discussed a while back about introducing a dedicated streaming mode
>> for
>> >> Flink. I would like to take a go at this and implement the changes, but
>> >> discuss them before.
>> >>
>> >>
>> >> Here is a brief summary why we wanted to introduce the dedicated
>> streaming
>> >> mode:
>> >> Even though both batch and streaming are executed by the same execution
>> >> engine,
>> >> a streaming setup of Flink varies a bit from a batch setup:
>> >>
>> >> 1) The streaming cluster starts an additional service to store the
>> >> distributed state snapshots.
>> >>
>> >> 2) Streaming mode uses memory a bit different, so we should configure
>> the
>> >> memory manager differently. This difference may eventually go away.
>> >>
>> >>
>> >>
>> >> Concretely, to implement this, I was thinking about introducing the
>> >> following externally visible changes
>> >>
>> >>  - Additional scripts "start-streaming-cluster.sh" and
>> >> "start-streaming-local.sh"
>> >>
>> >>  - An execution mode parameter for the TaskManager ("batch / streaming")
>> >>
>> >>  - An execution mode parameter for the JobManager TaskManager ("batch /
>> >> streaming")
>> >>
>> >>  - All local executors and mini clusters need a flag that specifies
>> whether
>> >> they will start
>> >>    a streaming cluster, or a pure batch cluster.
>> >>
>> >>
>> >> Anything else that comes to your minds?
>> >>
>> >>
>> >> Greetings,
>> >> Stephan
>> >>
>> >
>>

Re: [DISCUSS] Dedicated streaming mode

Posted by Gyula Fóra <gy...@apache.org>.
Huge +1 from my side :)

Sorry for the late response.

On Thu, May 21, 2015 at 9:54 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> This sounds very reasonable.
> On May 21, 2015 9:34 PM, "Stephan Ewen" <se...@apache.org> wrote:
>
> > I discussed a bit via Skype with Gyula and Paris.
> >
> >
> > We thought about the following way to do it:
> >
> >  - We add a dedicated streaming mode for now. The streaming mode
> supersedes
> > the batch mode, so it can run both type of programs.
> >
> >  - The streaming mode sets the memory manager to "lazy allocation".
> >     -> So long as it runs pure streaming jobs, the full heap will be
> > available to window buffers and UDFs.
> >     -> Batch programs can still run, so mixed workloads are not
> prevented.
> > Batch programs are a bit less robust there, because the memory manager
> does
> > not pre-allocate memory. UDFs can eat into Flink's memory portion.
> >
> >  - The streaming mode starts the necessary configured components/services
> > for state backups
> >
> >
> >
> > Over the next versions, we want to bring these things together:
> >   - use the managed memory for window buffers
> >   - on-demand starting of the state backend
> >
> > Then, we deprecate the streaming mode, let both modes start the cluster
> in
> > the same way.
> >
> >
> >
> >
> >
> > On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> > > Would it not be possible to start the snapshot service once the user
> > > starts the first streaming job? About 2) with checkpointing coming up,
> > > would it not make sense to shift to managed memory rather sooner than
> > > later. Then this point would become moot.
> > >
> > > On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> > > <mj...@informatik.hu-berlin.de> wrote:
> > > > What would be the consequences on "mixed" programs? (If there is any
> > > > plan to support those?)
> > > >
> > > > Would it be necessary to have a third mode? Or would those programs
> > > > simple run in streaming mode?
> > > >
> > > > -Matthias
> > > >
> > > > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> > > >> Hi all!
> > > >>
> > > >> We discussed a while back about introducing a dedicated streaming
> mode
> > > for
> > > >> Flink. I would like to take a go at this and implement the changes,
> > but
> > > >> discuss them before.
> > > >>
> > > >>
> > > >> Here is a brief summary why we wanted to introduce the dedicated
> > > streaming
> > > >> mode:
> > > >> Even though both batch and streaming are executed by the same
> > execution
> > > >> engine,
> > > >> a streaming setup of Flink varies a bit from a batch setup:
> > > >>
> > > >> 1) The streaming cluster starts an additional service to store the
> > > >> distributed state snapshots.
> > > >>
> > > >> 2) Streaming mode uses memory a bit different, so we should
> configure
> > > the
> > > >> memory manager differently. This difference may eventually go away.
> > > >>
> > > >>
> > > >>
> > > >> Concretely, to implement this, I was thinking about introducing the
> > > >> following externally visible changes
> > > >>
> > > >>  - Additional scripts "start-streaming-cluster.sh" and
> > > >> "start-streaming-local.sh"
> > > >>
> > > >>  - An execution mode parameter for the TaskManager ("batch /
> > streaming")
> > > >>
> > > >>  - An execution mode parameter for the JobManager TaskManager
> ("batch
> > /
> > > >> streaming")
> > > >>
> > > >>  - All local executors and mini clusters need a flag that specifies
> > > whether
> > > >> they will start
> > > >>    a streaming cluster, or a pure batch cluster.
> > > >>
> > > >>
> > > >> Anything else that comes to your minds?
> > > >>
> > > >>
> > > >> Greetings,
> > > >> Stephan
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Aljoscha Krettek <al...@apache.org>.
This sounds very reasonable.
On May 21, 2015 9:34 PM, "Stephan Ewen" <se...@apache.org> wrote:

> I discussed a bit via Skype with Gyula and Paris.
>
>
> We thought about the following way to do it:
>
>  - We add a dedicated streaming mode for now. The streaming mode supersedes
> the batch mode, so it can run both type of programs.
>
>  - The streaming mode sets the memory manager to "lazy allocation".
>     -> So long as it runs pure streaming jobs, the full heap will be
> available to window buffers and UDFs.
>     -> Batch programs can still run, so mixed workloads are not prevented.
> Batch programs are a bit less robust there, because the memory manager does
> not pre-allocate memory. UDFs can eat into Flink's memory portion.
>
>  - The streaming mode starts the necessary configured components/services
> for state backups
>
>
>
> Over the next versions, we want to bring these things together:
>   - use the managed memory for window buffers
>   - on-demand starting of the state backend
>
> Then, we deprecate the streaming mode, let both modes start the cluster in
> the same way.
>
>
>
>
>
> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
> > Would it not be possible to start the snapshot service once the user
> > starts the first streaming job? About 2) with checkpointing coming up,
> > would it not make sense to shift to managed memory rather sooner than
> > later. Then this point would become moot.
> >
> > On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> > <mj...@informatik.hu-berlin.de> wrote:
> > > What would be the consequences on "mixed" programs? (If there is any
> > > plan to support those?)
> > >
> > > Would it be necessary to have a third mode? Or would those programs
> > > simple run in streaming mode?
> > >
> > > -Matthias
> > >
> > > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> > >> Hi all!
> > >>
> > >> We discussed a while back about introducing a dedicated streaming mode
> > for
> > >> Flink. I would like to take a go at this and implement the changes,
> but
> > >> discuss them before.
> > >>
> > >>
> > >> Here is a brief summary why we wanted to introduce the dedicated
> > streaming
> > >> mode:
> > >> Even though both batch and streaming are executed by the same
> execution
> > >> engine,
> > >> a streaming setup of Flink varies a bit from a batch setup:
> > >>
> > >> 1) The streaming cluster starts an additional service to store the
> > >> distributed state snapshots.
> > >>
> > >> 2) Streaming mode uses memory a bit different, so we should configure
> > the
> > >> memory manager differently. This difference may eventually go away.
> > >>
> > >>
> > >>
> > >> Concretely, to implement this, I was thinking about introducing the
> > >> following externally visible changes
> > >>
> > >>  - Additional scripts "start-streaming-cluster.sh" and
> > >> "start-streaming-local.sh"
> > >>
> > >>  - An execution mode parameter for the TaskManager ("batch /
> streaming")
> > >>
> > >>  - An execution mode parameter for the JobManager TaskManager ("batch
> /
> > >> streaming")
> > >>
> > >>  - All local executors and mini clusters need a flag that specifies
> > whether
> > >> they will start
> > >>    a streaming cluster, or a pure batch cluster.
> > >>
> > >>
> > >> Anything else that comes to your minds?
> > >>
> > >>
> > >> Greetings,
> > >> Stephan
> > >>
> > >
> >
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Stephan Ewen <se...@apache.org>.
I discussed a bit via Skype with Gyula and Paris.


We thought about the following way to do it:

 - We add a dedicated streaming mode for now. The streaming mode supersedes
the batch mode, so it can run both type of programs.

 - The streaming mode sets the memory manager to "lazy allocation".
    -> So long as it runs pure streaming jobs, the full heap will be
available to window buffers and UDFs.
    -> Batch programs can still run, so mixed workloads are not prevented.
Batch programs are a bit less robust there, because the memory manager does
not pre-allocate memory. UDFs can eat into Flink's memory portion.

 - The streaming mode starts the necessary configured components/services
for state backups



Over the next versions, we want to bring these things together:
  - use the managed memory for window buffers
  - on-demand starting of the state backend

Then, we deprecate the streaming mode, let both modes start the cluster in
the same way.





On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Would it not be possible to start the snapshot service once the user
> starts the first streaming job? About 2) with checkpointing coming up,
> would it not make sense to shift to managed memory rather sooner than
> later. Then this point would become moot.
>
> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
> <mj...@informatik.hu-berlin.de> wrote:
> > What would be the consequences on "mixed" programs? (If there is any
> > plan to support those?)
> >
> > Would it be necessary to have a third mode? Or would those programs
> > simple run in streaming mode?
> >
> > -Matthias
> >
> > On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> >> Hi all!
> >>
> >> We discussed a while back about introducing a dedicated streaming mode
> for
> >> Flink. I would like to take a go at this and implement the changes, but
> >> discuss them before.
> >>
> >>
> >> Here is a brief summary why we wanted to introduce the dedicated
> streaming
> >> mode:
> >> Even though both batch and streaming are executed by the same execution
> >> engine,
> >> a streaming setup of Flink varies a bit from a batch setup:
> >>
> >> 1) The streaming cluster starts an additional service to store the
> >> distributed state snapshots.
> >>
> >> 2) Streaming mode uses memory a bit different, so we should configure
> the
> >> memory manager differently. This difference may eventually go away.
> >>
> >>
> >>
> >> Concretely, to implement this, I was thinking about introducing the
> >> following externally visible changes
> >>
> >>  - Additional scripts "start-streaming-cluster.sh" and
> >> "start-streaming-local.sh"
> >>
> >>  - An execution mode parameter for the TaskManager ("batch / streaming")
> >>
> >>  - An execution mode parameter for the JobManager TaskManager ("batch /
> >> streaming")
> >>
> >>  - All local executors and mini clusters need a flag that specifies
> whether
> >> they will start
> >>    a streaming cluster, or a pure batch cluster.
> >>
> >>
> >> Anything else that comes to your minds?
> >>
> >>
> >> Greetings,
> >> Stephan
> >>
> >
>

Re: [DISCUSS] Dedicated streaming mode

Posted by Aljoscha Krettek <al...@apache.org>.
Would it not be possible to start the snapshot service once the user
starts the first streaming job? About 2) with checkpointing coming up,
would it not make sense to shift to managed memory rather sooner than
later. Then this point would become moot.

On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax
<mj...@informatik.hu-berlin.de> wrote:
> What would be the consequences on "mixed" programs? (If there is any
> plan to support those?)
>
> Would it be necessary to have a third mode? Or would those programs
> simple run in streaming mode?
>
> -Matthias
>
> On 05/21/2015 03:12 PM, Stephan Ewen wrote:
>> Hi all!
>>
>> We discussed a while back about introducing a dedicated streaming mode for
>> Flink. I would like to take a go at this and implement the changes, but
>> discuss them before.
>>
>>
>> Here is a brief summary why we wanted to introduce the dedicated streaming
>> mode:
>> Even though both batch and streaming are executed by the same execution
>> engine,
>> a streaming setup of Flink varies a bit from a batch setup:
>>
>> 1) The streaming cluster starts an additional service to store the
>> distributed state snapshots.
>>
>> 2) Streaming mode uses memory a bit different, so we should configure the
>> memory manager differently. This difference may eventually go away.
>>
>>
>>
>> Concretely, to implement this, I was thinking about introducing the
>> following externally visible changes
>>
>>  - Additional scripts "start-streaming-cluster.sh" and
>> "start-streaming-local.sh"
>>
>>  - An execution mode parameter for the TaskManager ("batch / streaming")
>>
>>  - An execution mode parameter for the JobManager TaskManager ("batch /
>> streaming")
>>
>>  - All local executors and mini clusters need a flag that specifies whether
>> they will start
>>    a streaming cluster, or a pure batch cluster.
>>
>>
>> Anything else that comes to your minds?
>>
>>
>> Greetings,
>> Stephan
>>
>

Re: [DISCUSS] Dedicated streaming mode

Posted by "Matthias J. Sax" <mj...@informatik.hu-berlin.de>.
What would be the consequences on "mixed" programs? (If there is any
plan to support those?)

Would it be necessary to have a third mode? Or would those programs
simple run in streaming mode?

-Matthias

On 05/21/2015 03:12 PM, Stephan Ewen wrote:
> Hi all!
> 
> We discussed a while back about introducing a dedicated streaming mode for
> Flink. I would like to take a go at this and implement the changes, but
> discuss them before.
> 
> 
> Here is a brief summary why we wanted to introduce the dedicated streaming
> mode:
> Even though both batch and streaming are executed by the same execution
> engine,
> a streaming setup of Flink varies a bit from a batch setup:
> 
> 1) The streaming cluster starts an additional service to store the
> distributed state snapshots.
> 
> 2) Streaming mode uses memory a bit different, so we should configure the
> memory manager differently. This difference may eventually go away.
> 
> 
> 
> Concretely, to implement this, I was thinking about introducing the
> following externally visible changes
> 
>  - Additional scripts "start-streaming-cluster.sh" and
> "start-streaming-local.sh"
> 
>  - An execution mode parameter for the TaskManager ("batch / streaming")
> 
>  - An execution mode parameter for the JobManager TaskManager ("batch /
> streaming")
> 
>  - All local executors and mini clusters need a flag that specifies whether
> they will start
>    a streaming cluster, or a pure batch cluster.
> 
> 
> Anything else that comes to your minds?
> 
> 
> Greetings,
> Stephan
>