You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Matt <dr...@gmail.com> on 2017/04/24 09:47:49 UTC

Flink on Ignite - Collocation?

Hi all,

I've been playing around with Apache Ignite and I want to run Flink on top
of it but there's something I'm not getting.

Ignite has its own support for clustering, and data is distributed on
different nodes using a partitioned key. Then, we are able to run a closure
and do some computation on the nodes that owns the data (collocation of
computation [1]), that way saving time and bandwidth. It all looks good,
but I'm not sure how it would play with Flink's own clustering capability.

My initial idea -which I haven't tried yet- is to use collocation to run a
closure where the data resides, and use that closure to execute a Flink
pipeline locally on that node (running it using a local environment), then
using a custom made data source I should be able to plug the data from the
local Ignite cache to the Flink pipeline and back into a cache using an
Ignite sink.

I'm not sure it's a good idea to disable Flink distribution and running it
in a local environment so the data is not transferred to another node. I
think it's the same problem with Kafka, if it partitions the data on
different nodes, how do you guarantee that Flink jobs are executed where
the data resides? In case there's no way to guarantee that unless you
enable local environment, what do you think of that approach (in terms of
performance)?

Any additional insight regarding stream processing on Ignite or any other
distributed storage is very welcome!

Best regards,
Matt

[1] https://apacheignite.readme.io/docs/collocate-compute-and-data

Re: Flink on Ignite - Collocation?

Posted by Matt <dr...@gmail.com>.

It seems to me the bottleneck will be the network if I don't collocate
Flink jobs, after all Ignite caches are in-memory not on disk (much faster
than network).

Achieving collocation would be more difficult in Kafka, but it should be
relatively easy in Ignite due to its out of the box collocated computation
system. Running a collocated Ignite closure and executing a Flink job in a
local environments should be enough.

Why do you recommend against custom collocation? I may be missing something.

Matt

On Mon, Apr 24, 2017 at 9:47 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Hey Matt,
>
> in general, Flink doesn't put too much work in co-locating sources
> (doesn't happen for Kafka, etc. either). I think the only local
> assignments happen in the DataSet API for files in HDFS.
>
> Often this is of limited help anyways. Your approach sounds like it
> could work, but I would generally not recommend such custom solutions
> if you don't really need it. Have you tried running your program with
> remote reads? What's the bottleneck for you there?
>
> – Ufuk
>
> On Mon, Apr 24, 2017 at 11:47 AM, Matt <dr...@gmail.com> wrote:
> > Hi all,
> >
> > I've been playing around with Apache Ignite and I want to run Flink on
> top
> > of it but there's something I'm not getting.
> >
> > Ignite has its own support for clustering, and data is distributed on
> > different nodes using a partitioned key. Then, we are able to run a
> closure
> > and do some computation on the nodes that owns the data (collocation of
> > computation [1]), that way saving time and bandwidth. It all looks good,
> but
> > I'm not sure how it would play with Flink's own clustering capability.
> >
> > My initial idea -which I haven't tried yet- is to use collocation to run
> a
> > closure where the data resides, and use that closure to execute a Flink
> > pipeline locally on that node (running it using a local environment),
> then
> > using a custom made data source I should be able to plug the data from
> the
> > local Ignite cache to the Flink pipeline and back into a cache using an
> > Ignite sink.
> >
> > I'm not sure it's a good idea to disable Flink distribution and running
> it
> > in a local environment so the data is not transferred to another node. I
> > think it's the same problem with Kafka, if it partitions the data on
> > different nodes, how do you guarantee that Flink jobs are executed where
> the
> > data resides? In case there's no way to guarantee that unless you enable
> > local environment, what do you think of that approach (in terms of
> > performance)?
> >
> > Any additional insight regarding stream processing on Ignite or any other
> > distributed storage is very welcome!
> >
> > Best regards,
> > Matt
> >
> > [1] https://apacheignite.readme.io/docs/collocate-compute-and-data
>

Re: Flink on Ignite - Collocation?

Posted by Ufuk Celebi <uc...@apache.org>.

Hey Matt,

in general, Flink doesn't put too much work in co-locating sources
(doesn't happen for Kafka, etc. either). I think the only local
assignments happen in the DataSet API for files in HDFS.

Often this is of limited help anyways. Your approach sounds like it
could work, but I would generally not recommend such custom solutions
if you don't really need it. Have you tried running your program with
remote reads? What's the bottleneck for you there?

– Ufuk

On Mon, Apr 24, 2017 at 11:47 AM, Matt <dr...@gmail.com> wrote:
> Hi all,
>
> I've been playing around with Apache Ignite and I want to run Flink on top
> of it but there's something I'm not getting.
>
> Ignite has its own support for clustering, and data is distributed on
> different nodes using a partitioned key. Then, we are able to run a closure
> and do some computation on the nodes that owns the data (collocation of
> computation [1]), that way saving time and bandwidth. It all looks good, but
> I'm not sure how it would play with Flink's own clustering capability.
>
> My initial idea -which I haven't tried yet- is to use collocation to run a
> closure where the data resides, and use that closure to execute a Flink
> pipeline locally on that node (running it using a local environment), then
> using a custom made data source I should be able to plug the data from the
> local Ignite cache to the Flink pipeline and back into a cache using an
> Ignite sink.
>
> I'm not sure it's a good idea to disable Flink distribution and running it
> in a local environment so the data is not transferred to another node. I
> think it's the same problem with Kafka, if it partitions the data on
> different nodes, how do you guarantee that Flink jobs are executed where the
> data resides? In case there's no way to guarantee that unless you enable
> local environment, what do you think of that approach (in terms of
> performance)?
>
> Any additional insight regarding stream processing on Ignite or any other
> distributed storage is very welcome!
>
> Best regards,
> Matt
>
> [1] https://apacheignite.readme.io/docs/collocate-compute-and-data