You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Maxim <mf...@gmail.com> on 2016/03/07 21:18:33 UTC

Dynamically repartitioned sources

I'm looking at using Flink for a streaming project that has to use some
internal systems as event sources. They are very similar to Kafka in their
semantic. The data is partitioned and each partition can be replayed from a
specified offset.

The first system creates and deletes such partitions dynamically based on
load. It provides an API to get list of partitions as well as their state
(open, closed for append).

The second system has a fixed set of a few thousand partitions, but they
are allocated to a dynamic set of hosts and each host provides poll API
that returns events from all partitions that currently reside on it. The
metadata API that returns current mapping of partitions to hosts is
provided.

I found a thread
<http://mail-archives.apache.org/mod_mbox/flink-user/201602.mbox/%3CCANjo42xgyUZAU=fmgGVFXVYMj7nVt67=3eJY=pWRc_nZdQ-EkA@mail.gmail.com%3E>
that
mentioned that changing parallelism is one of the high priority items for
this year. Has any work started on it? And would it support the type of
dynamic sources we have?

I could try adding such support myself if it would help to speed things up.

Thanks,

Maxim.

Re: Dynamically repartitioned sources

Posted by Robert Metzger <rm...@apache.org>.

Hi Maxim,

you can implement a source for the system you are describing without
changing the parallelism of Flink. What you have to do is implement your
own data sources for Flink.
I would start by implementing the ParallelSourceFunction interface, where
each parallel source instance is reading from a subset of servers.

So basically one "flink partition" is reading from one or more partitions
of your system.


On Mon, Mar 7, 2016 at 9:18 PM, Maxim <mf...@gmail.com> wrote:

> I'm looking at using Flink for a streaming project that has to use some
> internal systems as event sources. They are very similar to Kafka in their
> semantic. The data is partitioned and each partition can be replayed from a
> specified offset.
>
> The first system creates and deletes such partitions dynamically based on
> load. It provides an API to get list of partitions as well as their state
> (open, closed for append).
>
> The second system has a fixed set of a few thousand partitions, but they
> are allocated to a dynamic set of hosts and each host provides poll API
> that returns events from all partitions that currently reside on it. The
> metadata API that returns current mapping of partitions to hosts is
> provided.
>
> I found a thread
> <
> http://mail-archives.apache.org/mod_mbox/flink-user/201602.mbox/%3CCANjo42xgyUZAU=fmgGVFXVYMj7nVt67=3eJY=pWRc_nZdQ-EkA@mail.gmail.com%3E
> >
> that
> mentioned that changing parallelism is one of the high priority items for
> this year. Has any work started on it? And would it support the type of
> dynamic sources we have?
>
> I could try adding such support myself if it would help to speed things up.
>
> Thanks,
>
> Maxim.
>