You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Suminda Dharmasena <si...@sakrio.com> on 2015/04/26 15:06:28 UTC

Apache Flink and Spark Integration and Acceleration

Is it possible to consider deeper integration with Flink and Spark

Re: [DISCUSS] Flink and Ignite integration

Posted by Fabian Hueske <fh...@gmail.com>.

That's a good question... We are still in the design phase for this feature.

Initially I would have said that replicated in-memory is what we want.
However, Flink is aiming to support long running stream analytics (weeks,
months, ...) and it would be bad if state collected over such a long time
would be lost. So some kind of disk persistence would be good for certain
use cases.



2015-04-29 1:28 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:

> On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > Thanks Cos for starting this discussion, hi to the Ignite community!
> >
> > The probably easiest and most straightforward integration of Flink and
> > Ignite would be to go through Ignite's IGFS. Flink can be easily extended
> > to support additional filesystems.
> >
> > However, the Flink community is currently also looking for a solution to
> > checkpoint operator state of running stream processing programs. Flink
> > processes data streams in real time similar to Storm, i.e., it schedules
> > all operators of a streaming program and data is continuously flowing
> from
> > operator to operator. Instead of acknowledging each individual record,
> > Flink injects stream offset markers into the stream in regular intervals.
> > Whenever, an operator receives such a marker it checkpoints its current
> > state (currently to the master with some limitations). In case of a
> > failure, the stream is replayed (using a replayable source such as Kafka)
> > from the last checkpoint that was not received by all sink operators and
> > all operator states are reset to that checkpoint.
> > We had already looked at Ignite and were wondering whether Ignite could
> be
> > used to reliably persist the state of streaming operator.
> >
>
> Fabian, do you need these checkpoints stored in memory (with optional
> redundant copies, or course) or on disk? I think in-memory makes a lot more
> sense from performance standpoint, and can easily be done in Ignite.
>
>
> >
> > The other points I mentioned on Twitter are just rough ideas at the
> moment.
> >
> > Cheers, Fabian
> >
> > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Thanks Cos.
> > >
> > > Hello Flink Community.
> > >
> > > From Ignite standpoint we definitely would be interested in providing
> > Flink
> > > processing API on top of Ignite Data Grid or IGFS. It would be
> > interesting
> > > to hear what steps would be required for such integration or if there
> are
> > > other integration points.
> > >
> > > D.
> > >
> > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > > wrote:
> > >
> > > > Following the lively exchange in Twitter (sic!) I would like to bring
> > > > together
> > > > Ignite and Flink communities to discuss the benefits of the
> integration
> > > and
> > > > see where we can start it.
> > > >
> > > > We have this recently opened ticket
> > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > >
> > > > and Fabian has listed the following points:
> > > >
> > > >  1) data store
> > > >  2) parameter server for ML models
> > > >  3) Checkpointing streaming op state
> > > >  4) continuously updating views from streams
> > > >
> > > > I'd add
> > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > >
> > > > I see a lot of interesting correlations between two projects and
> wonder
> > > if
> > > > Flink guys can step up with a few thoughts on where Flink can benefit
> > the
> > > > most
> > > > from Ignite's in-memory fabric architecture? Perhaps, it can be used
> as
> > > > in-memory storage where the other components of the stack can quickly
> > > > access
> > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > >
> > > > Thoughts?
> > > >   Cos
> > > >
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Fabian Hueske <fh...@gmail.com>.

That's a good question... We are still in the design phase for this feature.

Initially I would have said that replicated in-memory is what we want.
However, Flink is aiming to support long running stream analytics (weeks,
months, ...) and it would be bad if state collected over such a long time
would be lost. So some kind of disk persistence would be good for certain
use cases.



2015-04-29 1:28 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:

> On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > Thanks Cos for starting this discussion, hi to the Ignite community!
> >
> > The probably easiest and most straightforward integration of Flink and
> > Ignite would be to go through Ignite's IGFS. Flink can be easily extended
> > to support additional filesystems.
> >
> > However, the Flink community is currently also looking for a solution to
> > checkpoint operator state of running stream processing programs. Flink
> > processes data streams in real time similar to Storm, i.e., it schedules
> > all operators of a streaming program and data is continuously flowing
> from
> > operator to operator. Instead of acknowledging each individual record,
> > Flink injects stream offset markers into the stream in regular intervals.
> > Whenever, an operator receives such a marker it checkpoints its current
> > state (currently to the master with some limitations). In case of a
> > failure, the stream is replayed (using a replayable source such as Kafka)
> > from the last checkpoint that was not received by all sink operators and
> > all operator states are reset to that checkpoint.
> > We had already looked at Ignite and were wondering whether Ignite could
> be
> > used to reliably persist the state of streaming operator.
> >
>
> Fabian, do you need these checkpoints stored in memory (with optional
> redundant copies, or course) or on disk? I think in-memory makes a lot more
> sense from performance standpoint, and can easily be done in Ignite.
>
>
> >
> > The other points I mentioned on Twitter are just rough ideas at the
> moment.
> >
> > Cheers, Fabian
> >
> > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Thanks Cos.
> > >
> > > Hello Flink Community.
> > >
> > > From Ignite standpoint we definitely would be interested in providing
> > Flink
> > > processing API on top of Ignite Data Grid or IGFS. It would be
> > interesting
> > > to hear what steps would be required for such integration or if there
> are
> > > other integration points.
> > >
> > > D.
> > >
> > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > > wrote:
> > >
> > > > Following the lively exchange in Twitter (sic!) I would like to bring
> > > > together
> > > > Ignite and Flink communities to discuss the benefits of the
> integration
> > > and
> > > > see where we can start it.
> > > >
> > > > We have this recently opened ticket
> > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > >
> > > > and Fabian has listed the following points:
> > > >
> > > >  1) data store
> > > >  2) parameter server for ML models
> > > >  3) Checkpointing streaming op state
> > > >  4) continuously updating views from streams
> > > >
> > > > I'd add
> > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > >
> > > > I see a lot of interesting correlations between two projects and
> wonder
> > > if
> > > > Flink guys can step up with a few thoughts on where Flink can benefit
> > the
> > > > most
> > > > from Ignite's in-memory fabric architecture? Perhaps, it can be used
> as
> > > > in-memory storage where the other components of the stack can quickly
> > > > access
> > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > >
> > > > Thoughts?
> > > >   Cos
> > > >
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Hi Stephan,

Your suggestions are very interesting. I think we should pick a couple of
paths we can tackle with minimal effort and start there. I am happy to help
in getting this effort started (maybe we should have a Skype discussion?)

My comments are below...

D.

On Wed, Apr 29, 2015 at 3:35 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi everyone!
>
> First of all, hello to the Ignite community and happy to hear that you are
> interested in collaborating!
>
> Building on what Fabian wrote, here is a list of efforts that we ourselves
> have started, or that would be useful.
>
> Let us know what you think!
>
> Stephan
>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as a FileSystem
>
> -------------------------------------------------------------------------------------------------------
>
> That should be the simplest addition. Flink integrates the FileSystem
> classes from Hadoop. If there is an Ignite version of that FileSystem
> class, you should
> be able to register it in a Hadoop config, point to the config in the Flink
> config and it should work out of the box.
>
> If Ignite does not yet have that FileSystem, it is easy to implement a
> Flink Filesystem.
>


Ignite implements Hadoop File System API. More info here:
http://apacheignite.readme.io/v1.0/docs/file-system



>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as a parameter server
>
> -------------------------------------------------------------------------------------------------------
>
> This is one approach that a contributor has started with. The goal is to
> store a large set of model parameters in a distributed fashion,
> such that all Flink TaskManagers can access them and update them
> (asynchronously).
>
> The core requirements here are:
>  - Fast put performance. Often, no consistency is needed, put operations
> may simply overwrite each other, some of them can even be tolerated to get
> lost
>  - Fast get performance, heavy local caching.
>


Sounds like a very straight forward integration with IgniteCache:
http://apacheignite.readme.io/v1.0/docs/jcache



>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for Streaming Operator State
>
> -------------------------------------------------------------------------------------------------------
>
> Flink periodically checkpoints the state of streaming operators. We are
> looking to have different backends to
> store the state to, and Ignite could be one of them.
>
> This would write periodically (say every 5 seconds) a chunk of binary data
> (possible 100s of MB on every node) into Ignite.
>


Again, I think you can utilize either partitioned or replicated caches from
Ignite here.



>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for Streaming Operator State
>
> -------------------------------------------------------------------------------------------------------
>
> If we want to directly store the state of streaming computation in Ignite
> (rather than storing it in Flink and backing it
> up to Ignite), we have the following requirements:
>
>   - Massive put and get performance, up to millions per second per machine.
>   - No synchronous replication needed, replication can happen
> asynchronously in the background
>   - At certain points, Flink will request to get a signal once everything
> is properly replicated
>


This sounds like a good use case for Ignite Streaming, which basically
loads large continuous amounts of streamed data into Ignite caches. Ignite
has abstraction called "IgniteDataStreamer" would satisfy your
requirements. It does everything asynchronously and can provide
notifications if needed. More info here:
http://apacheignite.readme.io/v1.0/docs/data-streamers



>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for intermediate results
>
> -------------------------------------------------------------------------------------------------------
>
> Flink may cache intermediate results for recovering or resuming computation
> at a certain point in the program. This would be similar to backing up
> streaming state. One in a while, a giant put
> operation with GBs of binary data.
>
>

I think Ignite File System (IGFS) would be a perfect candidate for it. If
this approach does not work, then you can think about using Ignite caches
directly, but it may get a bit tricky if you plan to store objects with 1GB
of size each.



>
>
>
> -------------------------------------------------------------------------------------------------------
> Run Flink Batch Programs on Ignite's compute fabric.
>
> -------------------------------------------------------------------------------------------------------
>
> I think this would be interesting, and we can make this such that programs
> are binary compatible.
> Flink currently has multiple execution backends already: Flink local, Flink
> distributed, Tez, Java Collections.
> It is designed layerd and pluggable
>
> You as a programmer define the desired execution backend by chosing the
> corresponding ExecutionEnvironment,
> such as "ExecutionEnvironemtn.createLocalEnvironement()", or
> "ExecutionEnvironemtn.createCollectionsEnvironement()"
> If you look at the "execute()" methods, they take the Flink program and
> prepares it for execution in the corresponding backend.
>


Hm... this sounds *very* interesting. If I understand correctly, you are
suggesting that Ignite becomes one of the Flink backends, right? Is there a
basic example online or in the product for it, so I can gage what it would
take?



>
>
>
>
> -------------------------------------------------------------------------------------------------------
> Run Flink Streaming Programs on Ignite's compute fabric.
>
> -------------------------------------------------------------------------------------------------------
>
> The execution mechanism for streaming programs is changing fast right now.
> I would postpone this for a few
> weeks until we have converged there.
>
>
Sounds good.


>
>
>
>
>
>
> On Wed, Apr 29, 2015 at 1:28 AM, Dmitriy Setrakyan <ds...@apache.org>
> wrote:
>
> > On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > Thanks Cos for starting this discussion, hi to the Ignite community!
> > >
> > > The probably easiest and most straightforward integration of Flink and
> > > Ignite would be to go through Ignite's IGFS. Flink can be easily
> extended
> > > to support additional filesystems.
> > >
> > > However, the Flink community is currently also looking for a solution
> to
> > > checkpoint operator state of running stream processing programs. Flink
> > > processes data streams in real time similar to Storm, i.e., it
> schedules
> > > all operators of a streaming program and data is continuously flowing
> > from
> > > operator to operator. Instead of acknowledging each individual record,
> > > Flink injects stream offset markers into the stream in regular
> intervals.
> > > Whenever, an operator receives such a marker it checkpoints its current
> > > state (currently to the master with some limitations). In case of a
> > > failure, the stream is replayed (using a replayable source such as
> Kafka)
> > > from the last checkpoint that was not received by all sink operators
> and
> > > all operator states are reset to that checkpoint.
> > > We had already looked at Ignite and were wondering whether Ignite could
> > be
> > > used to reliably persist the state of streaming operator.
> > >
> >
> > Fabian, do you need these checkpoints stored in memory (with optional
> > redundant copies, or course) or on disk? I think in-memory makes a lot
> more
> > sense from performance standpoint, and can easily be done in Ignite.
> >
> >
> > >
> > > The other points I mentioned on Twitter are just rough ideas at the
> > moment.
> > >
> > > Cheers, Fabian
> > >
> > > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
> > >
> > > > Thanks Cos.
> > > >
> > > > Hello Flink Community.
> > > >
> > > > From Ignite standpoint we definitely would be interested in providing
> > > Flink
> > > > processing API on top of Ignite Data Grid or IGFS. It would be
> > > interesting
> > > > to hear what steps would be required for such integration or if there
> > are
> > > > other integration points.
> > > >
> > > > D.
> > > >
> > > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > > > wrote:
> > > >
> > > > > Following the lively exchange in Twitter (sic!) I would like to
> bring
> > > > > together
> > > > > Ignite and Flink communities to discuss the benefits of the
> > integration
> > > > and
> > > > > see where we can start it.
> > > > >
> > > > > We have this recently opened ticket
> > > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > > >
> > > > > and Fabian has listed the following points:
> > > > >
> > > > >  1) data store
> > > > >  2) parameter server for ML models
> > > > >  3) Checkpointing streaming op state
> > > > >  4) continuously updating views from streams
> > > > >
> > > > > I'd add
> > > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > > >
> > > > > I see a lot of interesting correlations between two projects and
> > wonder
> > > > if
> > > > > Flink guys can step up with a few thoughts on where Flink can
> benefit
> > > the
> > > > > most
> > > > > from Ignite's in-memory fabric architecture? Perhaps, it can be
> used
> > as
> > > > > in-memory storage where the other components of the stack can
> quickly
> > > > > access
> > > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > > >
> > > > > Thoughts?
> > > > >   Cos
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Hi Stephan,

Your suggestions are very interesting. I think we should pick a couple of
paths we can tackle with minimal effort and start there. I am happy to help
in getting this effort started (maybe we should have a Skype discussion?)

My comments are below...

D.

On Wed, Apr 29, 2015 at 3:35 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi everyone!
>
> First of all, hello to the Ignite community and happy to hear that you are
> interested in collaborating!
>
> Building on what Fabian wrote, here is a list of efforts that we ourselves
> have started, or that would be useful.
>
> Let us know what you think!
>
> Stephan
>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as a FileSystem
>
> -------------------------------------------------------------------------------------------------------
>
> That should be the simplest addition. Flink integrates the FileSystem
> classes from Hadoop. If there is an Ignite version of that FileSystem
> class, you should
> be able to register it in a Hadoop config, point to the config in the Flink
> config and it should work out of the box.
>
> If Ignite does not yet have that FileSystem, it is easy to implement a
> Flink Filesystem.
>


Ignite implements Hadoop File System API. More info here:
http://apacheignite.readme.io/v1.0/docs/file-system



>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as a parameter server
>
> -------------------------------------------------------------------------------------------------------
>
> This is one approach that a contributor has started with. The goal is to
> store a large set of model parameters in a distributed fashion,
> such that all Flink TaskManagers can access them and update them
> (asynchronously).
>
> The core requirements here are:
>  - Fast put performance. Often, no consistency is needed, put operations
> may simply overwrite each other, some of them can even be tolerated to get
> lost
>  - Fast get performance, heavy local caching.
>


Sounds like a very straight forward integration with IgniteCache:
http://apacheignite.readme.io/v1.0/docs/jcache



>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for Streaming Operator State
>
> -------------------------------------------------------------------------------------------------------
>
> Flink periodically checkpoints the state of streaming operators. We are
> looking to have different backends to
> store the state to, and Ignite could be one of them.
>
> This would write periodically (say every 5 seconds) a chunk of binary data
> (possible 100s of MB on every node) into Ignite.
>


Again, I think you can utilize either partitioned or replicated caches from
Ignite here.



>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for Streaming Operator State
>
> -------------------------------------------------------------------------------------------------------
>
> If we want to directly store the state of streaming computation in Ignite
> (rather than storing it in Flink and backing it
> up to Ignite), we have the following requirements:
>
>   - Massive put and get performance, up to millions per second per machine.
>   - No synchronous replication needed, replication can happen
> asynchronously in the background
>   - At certain points, Flink will request to get a signal once everything
> is properly replicated
>


This sounds like a good use case for Ignite Streaming, which basically
loads large continuous amounts of streamed data into Ignite caches. Ignite
has abstraction called "IgniteDataStreamer" would satisfy your
requirements. It does everything asynchronously and can provide
notifications if needed. More info here:
http://apacheignite.readme.io/v1.0/docs/data-streamers



>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for intermediate results
>
> -------------------------------------------------------------------------------------------------------
>
> Flink may cache intermediate results for recovering or resuming computation
> at a certain point in the program. This would be similar to backing up
> streaming state. One in a while, a giant put
> operation with GBs of binary data.
>
>

I think Ignite File System (IGFS) would be a perfect candidate for it. If
this approach does not work, then you can think about using Ignite caches
directly, but it may get a bit tricky if you plan to store objects with 1GB
of size each.



>
>
>
> -------------------------------------------------------------------------------------------------------
> Run Flink Batch Programs on Ignite's compute fabric.
>
> -------------------------------------------------------------------------------------------------------
>
> I think this would be interesting, and we can make this such that programs
> are binary compatible.
> Flink currently has multiple execution backends already: Flink local, Flink
> distributed, Tez, Java Collections.
> It is designed layerd and pluggable
>
> You as a programmer define the desired execution backend by chosing the
> corresponding ExecutionEnvironment,
> such as "ExecutionEnvironemtn.createLocalEnvironement()", or
> "ExecutionEnvironemtn.createCollectionsEnvironement()"
> If you look at the "execute()" methods, they take the Flink program and
> prepares it for execution in the corresponding backend.
>


Hm... this sounds *very* interesting. If I understand correctly, you are
suggesting that Ignite becomes one of the Flink backends, right? Is there a
basic example online or in the product for it, so I can gage what it would
take?



>
>
>
>
> -------------------------------------------------------------------------------------------------------
> Run Flink Streaming Programs on Ignite's compute fabric.
>
> -------------------------------------------------------------------------------------------------------
>
> The execution mechanism for streaming programs is changing fast right now.
> I would postpone this for a few
> weeks until we have converged there.
>
>
Sounds good.


>
>
>
>
>
>
> On Wed, Apr 29, 2015 at 1:28 AM, Dmitriy Setrakyan <ds...@apache.org>
> wrote:
>
> > On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > Thanks Cos for starting this discussion, hi to the Ignite community!
> > >
> > > The probably easiest and most straightforward integration of Flink and
> > > Ignite would be to go through Ignite's IGFS. Flink can be easily
> extended
> > > to support additional filesystems.
> > >
> > > However, the Flink community is currently also looking for a solution
> to
> > > checkpoint operator state of running stream processing programs. Flink
> > > processes data streams in real time similar to Storm, i.e., it
> schedules
> > > all operators of a streaming program and data is continuously flowing
> > from
> > > operator to operator. Instead of acknowledging each individual record,
> > > Flink injects stream offset markers into the stream in regular
> intervals.
> > > Whenever, an operator receives such a marker it checkpoints its current
> > > state (currently to the master with some limitations). In case of a
> > > failure, the stream is replayed (using a replayable source such as
> Kafka)
> > > from the last checkpoint that was not received by all sink operators
> and
> > > all operator states are reset to that checkpoint.
> > > We had already looked at Ignite and were wondering whether Ignite could
> > be
> > > used to reliably persist the state of streaming operator.
> > >
> >
> > Fabian, do you need these checkpoints stored in memory (with optional
> > redundant copies, or course) or on disk? I think in-memory makes a lot
> more
> > sense from performance standpoint, and can easily be done in Ignite.
> >
> >
> > >
> > > The other points I mentioned on Twitter are just rough ideas at the
> > moment.
> > >
> > > Cheers, Fabian
> > >
> > > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
> > >
> > > > Thanks Cos.
> > > >
> > > > Hello Flink Community.
> > > >
> > > > From Ignite standpoint we definitely would be interested in providing
> > > Flink
> > > > processing API on top of Ignite Data Grid or IGFS. It would be
> > > interesting
> > > > to hear what steps would be required for such integration or if there
> > are
> > > > other integration points.
> > > >
> > > > D.
> > > >
> > > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > > > wrote:
> > > >
> > > > > Following the lively exchange in Twitter (sic!) I would like to
> bring
> > > > > together
> > > > > Ignite and Flink communities to discuss the benefits of the
> > integration
> > > > and
> > > > > see where we can start it.
> > > > >
> > > > > We have this recently opened ticket
> > > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > > >
> > > > > and Fabian has listed the following points:
> > > > >
> > > > >  1) data store
> > > > >  2) parameter server for ML models
> > > > >  3) Checkpointing streaming op state
> > > > >  4) continuously updating views from streams
> > > > >
> > > > > I'd add
> > > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > > >
> > > > > I see a lot of interesting correlations between two projects and
> > wonder
> > > > if
> > > > > Flink guys can step up with a few thoughts on where Flink can
> benefit
> > > the
> > > > > most
> > > > > from Ignite's in-memory fabric architecture? Perhaps, it can be
> used
> > as
> > > > > in-memory storage where the other components of the stack can
> quickly
> > > > > access
> > > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > > >
> > > > > Thoughts?
> > > > >   Cos
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Stephan Ewen <se...@apache.org>.

Hi everyone!

First of all, hello to the Ignite community and happy to hear that you are
interested in collaborating!

Building on what Fabian wrote, here is a list of efforts that we ourselves
have started, or that would be useful.

Let us know what you think!

Stephan


-------------------------------------------------------------------------------------------------------
Ignite as a FileSystem
-------------------------------------------------------------------------------------------------------

That should be the simplest addition. Flink integrates the FileSystem
classes from Hadoop. If there is an Ignite version of that FileSystem
class, you should
be able to register it in a Hadoop config, point to the config in the Flink
config and it should work out of the box.

If Ignite does not yet have that FileSystem, it is easy to implement a
Flink Filesystem.


-------------------------------------------------------------------------------------------------------
Ignite as a parameter server
-------------------------------------------------------------------------------------------------------

This is one approach that a contributor has started with. The goal is to
store a large set of model parameters in a distributed fashion,
such that all Flink TaskManagers can access them and update them
(asynchronously).

The core requirements here are:
 - Fast put performance. Often, no consistency is needed, put operations
may simply overwrite each other, some of them can even be tolerated to get
lost
 - Fast get performance, heavy local caching.


-------------------------------------------------------------------------------------------------------
Ignite as distributed backup for Streaming Operator State
-------------------------------------------------------------------------------------------------------

Flink periodically checkpoints the state of streaming operators. We are
looking to have different backends to
store the state to, and Ignite could be one of them.

This would write periodically (say every 5 seconds) a chunk of binary data
(possible 100s of MB on every node) into Ignite.


-------------------------------------------------------------------------------------------------------
Ignite as distributed backup for Streaming Operator State
-------------------------------------------------------------------------------------------------------

If we want to directly store the state of streaming computation in Ignite
(rather than storing it in Flink and backing it
up to Ignite), we have the following requirements:

  - Massive put and get performance, up to millions per second per machine.
  - No synchronous replication needed, replication can happen
asynchronously in the background
  - At certain points, Flink will request to get a signal once everything
is properly replicated


-------------------------------------------------------------------------------------------------------
Ignite as distributed backup for intermediate results
-------------------------------------------------------------------------------------------------------

Flink may cache intermediate results for recovering or resuming computation
at a certain point in the program. This would be similar to backing up
streaming state. One in a while, a giant put
operation with GBs of binary data.



-------------------------------------------------------------------------------------------------------
Run Flink Batch Programs on Ignite's compute fabric.
-------------------------------------------------------------------------------------------------------

I think this would be interesting, and we can make this such that programs
are binary compatible.
Flink currently has multiple execution backends already: Flink local, Flink
distributed, Tez, Java Collections.
It is designed layerd and pluggable

You as a programmer define the desired execution backend by chosing the
corresponding ExecutionEnvironment,
such as "ExecutionEnvironemtn.createLocalEnvironement()", or
"ExecutionEnvironemtn.createCollectionsEnvironement()"
If you look at the "execute()" methods, they take the Flink program and
prepares it for execution in the corresponding backend.



-------------------------------------------------------------------------------------------------------
Run Flink Streaming Programs on Ignite's compute fabric.
-------------------------------------------------------------------------------------------------------

The execution mechanism for streaming programs is changing fast right now.
I would postpone this for a few
weeks until we have converged there.







On Wed, Apr 29, 2015 at 1:28 AM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > Thanks Cos for starting this discussion, hi to the Ignite community!
> >
> > The probably easiest and most straightforward integration of Flink and
> > Ignite would be to go through Ignite's IGFS. Flink can be easily extended
> > to support additional filesystems.
> >
> > However, the Flink community is currently also looking for a solution to
> > checkpoint operator state of running stream processing programs. Flink
> > processes data streams in real time similar to Storm, i.e., it schedules
> > all operators of a streaming program and data is continuously flowing
> from
> > operator to operator. Instead of acknowledging each individual record,
> > Flink injects stream offset markers into the stream in regular intervals.
> > Whenever, an operator receives such a marker it checkpoints its current
> > state (currently to the master with some limitations). In case of a
> > failure, the stream is replayed (using a replayable source such as Kafka)
> > from the last checkpoint that was not received by all sink operators and
> > all operator states are reset to that checkpoint.
> > We had already looked at Ignite and were wondering whether Ignite could
> be
> > used to reliably persist the state of streaming operator.
> >
>
> Fabian, do you need these checkpoints stored in memory (with optional
> redundant copies, or course) or on disk? I think in-memory makes a lot more
> sense from performance standpoint, and can easily be done in Ignite.
>
>
> >
> > The other points I mentioned on Twitter are just rough ideas at the
> moment.
> >
> > Cheers, Fabian
> >
> > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Thanks Cos.
> > >
> > > Hello Flink Community.
> > >
> > > From Ignite standpoint we definitely would be interested in providing
> > Flink
> > > processing API on top of Ignite Data Grid or IGFS. It would be
> > interesting
> > > to hear what steps would be required for such integration or if there
> are
> > > other integration points.
> > >
> > > D.
> > >
> > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > > wrote:
> > >
> > > > Following the lively exchange in Twitter (sic!) I would like to bring
> > > > together
> > > > Ignite and Flink communities to discuss the benefits of the
> integration
> > > and
> > > > see where we can start it.
> > > >
> > > > We have this recently opened ticket
> > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > >
> > > > and Fabian has listed the following points:
> > > >
> > > >  1) data store
> > > >  2) parameter server for ML models
> > > >  3) Checkpointing streaming op state
> > > >  4) continuously updating views from streams
> > > >
> > > > I'd add
> > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > >
> > > > I see a lot of interesting correlations between two projects and
> wonder
> > > if
> > > > Flink guys can step up with a few thoughts on where Flink can benefit
> > the
> > > > most
> > > > from Ignite's in-memory fabric architecture? Perhaps, it can be used
> as
> > > > in-memory storage where the other components of the stack can quickly
> > > > access
> > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > >
> > > > Thoughts?
> > > >   Cos
> > > >
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Stephan Ewen <se...@apache.org>.

Hi everyone!

First of all, hello to the Ignite community and happy to hear that you are
interested in collaborating!

Building on what Fabian wrote, here is a list of efforts that we ourselves
have started, or that would be useful.

Let us know what you think!

Stephan


-------------------------------------------------------------------------------------------------------
Ignite as a FileSystem
-------------------------------------------------------------------------------------------------------

That should be the simplest addition. Flink integrates the FileSystem
classes from Hadoop. If there is an Ignite version of that FileSystem
class, you should
be able to register it in a Hadoop config, point to the config in the Flink
config and it should work out of the box.

If Ignite does not yet have that FileSystem, it is easy to implement a
Flink Filesystem.


-------------------------------------------------------------------------------------------------------
Ignite as a parameter server
-------------------------------------------------------------------------------------------------------

This is one approach that a contributor has started with. The goal is to
store a large set of model parameters in a distributed fashion,
such that all Flink TaskManagers can access them and update them
(asynchronously).

The core requirements here are:
 - Fast put performance. Often, no consistency is needed, put operations
may simply overwrite each other, some of them can even be tolerated to get
lost
 - Fast get performance, heavy local caching.


-------------------------------------------------------------------------------------------------------
Ignite as distributed backup for Streaming Operator State
-------------------------------------------------------------------------------------------------------

Flink periodically checkpoints the state of streaming operators. We are
looking to have different backends to
store the state to, and Ignite could be one of them.

This would write periodically (say every 5 seconds) a chunk of binary data
(possible 100s of MB on every node) into Ignite.


-------------------------------------------------------------------------------------------------------
Ignite as distributed backup for Streaming Operator State
-------------------------------------------------------------------------------------------------------

If we want to directly store the state of streaming computation in Ignite
(rather than storing it in Flink and backing it
up to Ignite), we have the following requirements:

  - Massive put and get performance, up to millions per second per machine.
  - No synchronous replication needed, replication can happen
asynchronously in the background
  - At certain points, Flink will request to get a signal once everything
is properly replicated


-------------------------------------------------------------------------------------------------------
Ignite as distributed backup for intermediate results
-------------------------------------------------------------------------------------------------------

Flink may cache intermediate results for recovering or resuming computation
at a certain point in the program. This would be similar to backing up
streaming state. One in a while, a giant put
operation with GBs of binary data.



-------------------------------------------------------------------------------------------------------
Run Flink Batch Programs on Ignite's compute fabric.
-------------------------------------------------------------------------------------------------------

I think this would be interesting, and we can make this such that programs
are binary compatible.
Flink currently has multiple execution backends already: Flink local, Flink
distributed, Tez, Java Collections.
It is designed layerd and pluggable

You as a programmer define the desired execution backend by chosing the
corresponding ExecutionEnvironment,
such as "ExecutionEnvironemtn.createLocalEnvironement()", or
"ExecutionEnvironemtn.createCollectionsEnvironement()"
If you look at the "execute()" methods, they take the Flink program and
prepares it for execution in the corresponding backend.



-------------------------------------------------------------------------------------------------------
Run Flink Streaming Programs on Ignite's compute fabric.
-------------------------------------------------------------------------------------------------------

The execution mechanism for streaming programs is changing fast right now.
I would postpone this for a few
weeks until we have converged there.







On Wed, Apr 29, 2015 at 1:28 AM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > Thanks Cos for starting this discussion, hi to the Ignite community!
> >
> > The probably easiest and most straightforward integration of Flink and
> > Ignite would be to go through Ignite's IGFS. Flink can be easily extended
> > to support additional filesystems.
> >
> > However, the Flink community is currently also looking for a solution to
> > checkpoint operator state of running stream processing programs. Flink
> > processes data streams in real time similar to Storm, i.e., it schedules
> > all operators of a streaming program and data is continuously flowing
> from
> > operator to operator. Instead of acknowledging each individual record,
> > Flink injects stream offset markers into the stream in regular intervals.
> > Whenever, an operator receives such a marker it checkpoints its current
> > state (currently to the master with some limitations). In case of a
> > failure, the stream is replayed (using a replayable source such as Kafka)
> > from the last checkpoint that was not received by all sink operators and
> > all operator states are reset to that checkpoint.
> > We had already looked at Ignite and were wondering whether Ignite could
> be
> > used to reliably persist the state of streaming operator.
> >
>
> Fabian, do you need these checkpoints stored in memory (with optional
> redundant copies, or course) or on disk? I think in-memory makes a lot more
> sense from performance standpoint, and can easily be done in Ignite.
>
>
> >
> > The other points I mentioned on Twitter are just rough ideas at the
> moment.
> >
> > Cheers, Fabian
> >
> > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Thanks Cos.
> > >
> > > Hello Flink Community.
> > >
> > > From Ignite standpoint we definitely would be interested in providing
> > Flink
> > > processing API on top of Ignite Data Grid or IGFS. It would be
> > interesting
> > > to hear what steps would be required for such integration or if there
> are
> > > other integration points.
> > >
> > > D.
> > >
> > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > > wrote:
> > >
> > > > Following the lively exchange in Twitter (sic!) I would like to bring
> > > > together
> > > > Ignite and Flink communities to discuss the benefits of the
> integration
> > > and
> > > > see where we can start it.
> > > >
> > > > We have this recently opened ticket
> > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > >
> > > > and Fabian has listed the following points:
> > > >
> > > >  1) data store
> > > >  2) parameter server for ML models
> > > >  3) Checkpointing streaming op state
> > > >  4) continuously updating views from streams
> > > >
> > > > I'd add
> > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > >
> > > > I see a lot of interesting correlations between two projects and
> wonder
> > > if
> > > > Flink guys can step up with a few thoughts on where Flink can benefit
> > the
> > > > most
> > > > from Ignite's in-memory fabric architecture? Perhaps, it can be used
> as
> > > > in-memory storage where the other components of the stack can quickly
> > > > access
> > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > >
> > > > Thoughts?
> > > >   Cos
> > > >
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Thanks Cos for starting this discussion, hi to the Ignite community!
>
> The probably easiest and most straightforward integration of Flink and
> Ignite would be to go through Ignite's IGFS. Flink can be easily extended
> to support additional filesystems.
>
> However, the Flink community is currently also looking for a solution to
> checkpoint operator state of running stream processing programs. Flink
> processes data streams in real time similar to Storm, i.e., it schedules
> all operators of a streaming program and data is continuously flowing from
> operator to operator. Instead of acknowledging each individual record,
> Flink injects stream offset markers into the stream in regular intervals.
> Whenever, an operator receives such a marker it checkpoints its current
> state (currently to the master with some limitations). In case of a
> failure, the stream is replayed (using a replayable source such as Kafka)
> from the last checkpoint that was not received by all sink operators and
> all operator states are reset to that checkpoint.
> We had already looked at Ignite and were wondering whether Ignite could be
> used to reliably persist the state of streaming operator.
>

Fabian, do you need these checkpoints stored in memory (with optional
redundant copies, or course) or on disk? I think in-memory makes a lot more
sense from performance standpoint, and can easily be done in Ignite.


>
> The other points I mentioned on Twitter are just rough ideas at the moment.
>
> Cheers, Fabian
>
> 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
>
> > Thanks Cos.
> >
> > Hello Flink Community.
> >
> > From Ignite standpoint we definitely would be interested in providing
> Flink
> > processing API on top of Ignite Data Grid or IGFS. It would be
> interesting
> > to hear what steps would be required for such integration or if there are
> > other integration points.
> >
> > D.
> >
> > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > wrote:
> >
> > > Following the lively exchange in Twitter (sic!) I would like to bring
> > > together
> > > Ignite and Flink communities to discuss the benefits of the integration
> > and
> > > see where we can start it.
> > >
> > > We have this recently opened ticket
> > >   https://issues.apache.org/jira/browse/IGNITE-813
> > >
> > > and Fabian has listed the following points:
> > >
> > >  1) data store
> > >  2) parameter server for ML models
> > >  3) Checkpointing streaming op state
> > >  4) continuously updating views from streams
> > >
> > > I'd add
> > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > >
> > > I see a lot of interesting correlations between two projects and wonder
> > if
> > > Flink guys can step up with a few thoughts on where Flink can benefit
> the
> > > most
> > > from Ignite's in-memory fabric architecture? Perhaps, it can be used as
> > > in-memory storage where the other components of the stack can quickly
> > > access
> > > and work w/ the data w/o a need to dump it back to slow storage?
> > >
> > > Thoughts?
> > >   Cos
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Thanks Cos for starting this discussion, hi to the Ignite community!
>
> The probably easiest and most straightforward integration of Flink and
> Ignite would be to go through Ignite's IGFS. Flink can be easily extended
> to support additional filesystems.
>
> However, the Flink community is currently also looking for a solution to
> checkpoint operator state of running stream processing programs. Flink
> processes data streams in real time similar to Storm, i.e., it schedules
> all operators of a streaming program and data is continuously flowing from
> operator to operator. Instead of acknowledging each individual record,
> Flink injects stream offset markers into the stream in regular intervals.
> Whenever, an operator receives such a marker it checkpoints its current
> state (currently to the master with some limitations). In case of a
> failure, the stream is replayed (using a replayable source such as Kafka)
> from the last checkpoint that was not received by all sink operators and
> all operator states are reset to that checkpoint.
> We had already looked at Ignite and were wondering whether Ignite could be
> used to reliably persist the state of streaming operator.
>

Fabian, do you need these checkpoints stored in memory (with optional
redundant copies, or course) or on disk? I think in-memory makes a lot more
sense from performance standpoint, and can easily be done in Ignite.


>
> The other points I mentioned on Twitter are just rough ideas at the moment.
>
> Cheers, Fabian
>
> 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:
>
> > Thanks Cos.
> >
> > Hello Flink Community.
> >
> > From Ignite standpoint we definitely would be interested in providing
> Flink
> > processing API on top of Ignite Data Grid or IGFS. It would be
> interesting
> > to hear what steps would be required for such integration or if there are
> > other integration points.
> >
> > D.
> >
> > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> > wrote:
> >
> > > Following the lively exchange in Twitter (sic!) I would like to bring
> > > together
> > > Ignite and Flink communities to discuss the benefits of the integration
> > and
> > > see where we can start it.
> > >
> > > We have this recently opened ticket
> > >   https://issues.apache.org/jira/browse/IGNITE-813
> > >
> > > and Fabian has listed the following points:
> > >
> > >  1) data store
> > >  2) parameter server for ML models
> > >  3) Checkpointing streaming op state
> > >  4) continuously updating views from streams
> > >
> > > I'd add
> > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > >
> > > I see a lot of interesting correlations between two projects and wonder
> > if
> > > Flink guys can step up with a few thoughts on where Flink can benefit
> the
> > > most
> > > from Ignite's in-memory fabric architecture? Perhaps, it can be used as
> > > in-memory storage where the other components of the stack can quickly
> > > access
> > > and work w/ the data w/o a need to dump it back to slow storage?
> > >
> > > Thoughts?
> > >   Cos
> > >
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Fabian Hueske <fh...@gmail.com>.

Thanks Cos for starting this discussion, hi to the Ignite community!

The probably easiest and most straightforward integration of Flink and
Ignite would be to go through Ignite's IGFS. Flink can be easily extended
to support additional filesystems.

However, the Flink community is currently also looking for a solution to
checkpoint operator state of running stream processing programs. Flink
processes data streams in real time similar to Storm, i.e., it schedules
all operators of a streaming program and data is continuously flowing from
operator to operator. Instead of acknowledging each individual record,
Flink injects stream offset markers into the stream in regular intervals.
Whenever, an operator receives such a marker it checkpoints its current
state (currently to the master with some limitations). In case of a
failure, the stream is replayed (using a replayable source such as Kafka)
from the last checkpoint that was not received by all sink operators and
all operator states are reset to that checkpoint.
We had already looked at Ignite and were wondering whether Ignite could be
used to reliably persist the state of streaming operator.

The other points I mentioned on Twitter are just rough ideas at the moment.

Cheers, Fabian

2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:

> Thanks Cos.
>
> Hello Flink Community.
>
> From Ignite standpoint we definitely would be interested in providing Flink
> processing API on top of Ignite Data Grid or IGFS. It would be interesting
> to hear what steps would be required for such integration or if there are
> other integration points.
>
> D.
>
> On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
> > Following the lively exchange in Twitter (sic!) I would like to bring
> > together
> > Ignite and Flink communities to discuss the benefits of the integration
> and
> > see where we can start it.
> >
> > We have this recently opened ticket
> >   https://issues.apache.org/jira/browse/IGNITE-813
> >
> > and Fabian has listed the following points:
> >
> >  1) data store
> >  2) parameter server for ML models
> >  3) Checkpointing streaming op state
> >  4) continuously updating views from streams
> >
> > I'd add
> >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> >
> > I see a lot of interesting correlations between two projects and wonder
> if
> > Flink guys can step up with a few thoughts on where Flink can benefit the
> > most
> > from Ignite's in-memory fabric architecture? Perhaps, it can be used as
> > in-memory storage where the other components of the stack can quickly
> > access
> > and work w/ the data w/o a need to dump it back to slow storage?
> >
> > Thoughts?
> >   Cos
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Fabian Hueske <fh...@gmail.com>.

Thanks Cos for starting this discussion, hi to the Ignite community!

The probably easiest and most straightforward integration of Flink and
Ignite would be to go through Ignite's IGFS. Flink can be easily extended
to support additional filesystems.

However, the Flink community is currently also looking for a solution to
checkpoint operator state of running stream processing programs. Flink
processes data streams in real time similar to Storm, i.e., it schedules
all operators of a streaming program and data is continuously flowing from
operator to operator. Instead of acknowledging each individual record,
Flink injects stream offset markers into the stream in regular intervals.
Whenever, an operator receives such a marker it checkpoints its current
state (currently to the master with some limitations). In case of a
failure, the stream is replayed (using a replayable source such as Kafka)
from the last checkpoint that was not received by all sink operators and
all operator states are reset to that checkpoint.
We had already looked at Ignite and were wondering whether Ignite could be
used to reliably persist the state of streaming operator.

The other points I mentioned on Twitter are just rough ideas at the moment.

Cheers, Fabian

2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <ds...@apache.org>:

> Thanks Cos.
>
> Hello Flink Community.
>
> From Ignite standpoint we definitely would be interested in providing Flink
> processing API on top of Ignite Data Grid or IGFS. It would be interesting
> to hear what steps would be required for such integration or if there are
> other integration points.
>
> D.
>
> On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
> > Following the lively exchange in Twitter (sic!) I would like to bring
> > together
> > Ignite and Flink communities to discuss the benefits of the integration
> and
> > see where we can start it.
> >
> > We have this recently opened ticket
> >   https://issues.apache.org/jira/browse/IGNITE-813
> >
> > and Fabian has listed the following points:
> >
> >  1) data store
> >  2) parameter server for ML models
> >  3) Checkpointing streaming op state
> >  4) continuously updating views from streams
> >
> > I'd add
> >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> >
> > I see a lot of interesting correlations between two projects and wonder
> if
> > Flink guys can step up with a few thoughts on where Flink can benefit the
> > most
> > from Ignite's in-memory fabric architecture? Perhaps, it can be used as
> > in-memory storage where the other components of the stack can quickly
> > access
> > and work w/ the data w/o a need to dump it back to slow storage?
> >
> > Thoughts?
> >   Cos
> >
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Thanks Cos.

Hello Flink Community.

>From Ignite standpoint we definitely would be interested in providing Flink
processing API on top of Ignite Data Grid or IGFS. It would be interesting
to hear what steps would be required for such integration or if there are
other integration points.

D.

On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Following the lively exchange in Twitter (sic!) I would like to bring
> together
> Ignite and Flink communities to discuss the benefits of the integration and
> see where we can start it.
>
> We have this recently opened ticket
>   https://issues.apache.org/jira/browse/IGNITE-813
>
> and Fabian has listed the following points:
>
>  1) data store
>  2) parameter server for ML models
>  3) Checkpointing streaming op state
>  4) continuously updating views from streams
>
> I'd add
>  5) using Ignite IGFS to speed up Flink's access to HDFS data.
>
> I see a lot of interesting correlations between two projects and wonder if
> Flink guys can step up with a few thoughts on where Flink can benefit the
> most
> from Ignite's in-memory fabric architecture? Perhaps, it can be used as
> in-memory storage where the other components of the stack can quickly
> access
> and work w/ the data w/o a need to dump it back to slow storage?
>
> Thoughts?
>   Cos
>

Re: [DISCUSS] Flink and Ignite integration

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Thanks Cos.

Hello Flink Community.

>From Ignite standpoint we definitely would be interested in providing Flink
processing API on top of Ignite Data Grid or IGFS. It would be interesting
to hear what steps would be required for such integration or if there are
other integration points.

D.

On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Following the lively exchange in Twitter (sic!) I would like to bring
> together
> Ignite and Flink communities to discuss the benefits of the integration and
> see where we can start it.
>
> We have this recently opened ticket
>   https://issues.apache.org/jira/browse/IGNITE-813
>
> and Fabian has listed the following points:
>
>  1) data store
>  2) parameter server for ML models
>  3) Checkpointing streaming op state
>  4) continuously updating views from streams
>
> I'd add
>  5) using Ignite IGFS to speed up Flink's access to HDFS data.
>
> I see a lot of interesting correlations between two projects and wonder if
> Flink guys can step up with a few thoughts on where Flink can benefit the
> most
> from Ignite's in-memory fabric architecture? Perhaps, it can be used as
> in-memory storage where the other components of the stack can quickly
> access
> and work w/ the data w/o a need to dump it back to slow storage?
>
> Thoughts?
>   Cos
>

[DISCUSS] Flink and Ignite integration

Posted by Konstantin Boudnik <co...@apache.org>.

Following the lively exchange in Twitter (sic!) I would like to bring together
Ignite and Flink communities to discuss the benefits of the integration and
see where we can start it.

We have this recently opened ticket 
  https://issues.apache.org/jira/browse/IGNITE-813

and Fabian has listed the following points:

 1) data store
 2) parameter server for ML models
 3) Checkpointing streaming op state
 4) continuously updating views from streams

I'd add 
 5) using Ignite IGFS to speed up Flink's access to HDFS data.

I see a lot of interesting correlations between two projects and wonder if
Flink guys can step up with a few thoughts on where Flink can benefit the most
from Ignite's in-memory fabric architecture? Perhaps, it can be used as
in-memory storage where the other components of the stack can quickly access
and work w/ the data w/o a need to dump it back to slow storage?

Thoughts?
  Cos

[DISCUSS] Flink and Ignite integration

Posted by Konstantin Boudnik <co...@apache.org>.

Following the lively exchange in Twitter (sic!) I would like to bring together
Ignite and Flink communities to discuss the benefits of the integration and
see where we can start it.

We have this recently opened ticket 
  https://issues.apache.org/jira/browse/IGNITE-813

and Fabian has listed the following points:

 1) data store
 2) parameter server for ML models
 3) Checkpointing streaming op state
 4) continuously updating views from streams

I'd add 
 5) using Ignite IGFS to speed up Flink's access to HDFS data.

I see a lot of interesting correlations between two projects and wonder if
Flink guys can step up with a few thoughts on where Flink can benefit the most
from Ignite's in-memory fabric architecture? Perhaps, it can be used as
in-memory storage where the other components of the stack can quickly access
and work w/ the data w/o a need to dump it back to slow storage?

Thoughts?
  Cos

Re: Apache Flink and Spark Integration and Acceleration

Posted by Konstantin Boudnik <co...@apache.org>.

On Mon, Apr 27, 2015 at 12:57AM, Suminda Dharmasena wrote:
> I think it is better done by some one who knows the internals of both the
> project.
> 
> I created following for Flink integration:
> https://issues.apache.org/jira/browse/IGNITE-813

Thanks for the ticket. Apache projects are developed by people who are working
on them either for fun or on their own schedules. In other words, ASF projects
do not have a boss to schedule the development. With that in mind you might
not see quick implementation of the feature, unless someone _wants_ to work on it.

Cos

Re: Apache Flink and Spark Integration and Acceleration

Posted by Suminda Dharmasena <si...@sakrio.com>.

I think it is better done by some one who knows the internals of both the
project.

I created following for Flink integration:
https://issues.apache.org/jira/browse/IGNITE-813

Re: Apache Flink and Spark Integration and Acceleration

Posted by Konstantin Boudnik <co...@apache.org>.

I think it'd be great to have Suminda. In fact we have this 
  https://issues.apache.org/jira/browse/IGNITE-389
open. Would you consider taking a stab at it? It is an open source project -
you don't need to ask for permissions to contribute something!

Thanks!
  Cos

On Sun, Apr 26, 2015 at 06:36PM, Suminda Dharmasena wrote:
> Is it possible to consider deeper integration with Flink and Spark