You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Alex Rovner <al...@magnetic.com> on 2015/12/11 16:00:55 UTC

streaming state

Hello all,

I was wondering if someone would be kind enough to enlighten me on a few
topics. We are trying to join two streams of data on a key. We were
thinking of partitioning topics in Kafka by the key, however I also saw
that Flink is able to partition on its own and I was wondering whether
Flink can take advantage of Kafka's partitioning and/or which partitioning
scheme should I go for?

As far as joins, our two datasets are very large (millions of records in
each window) and we need to perform the joins very quickly (less than 1
sec). Would the built in join mechanism be sufficient for this? From what I
understand it would need to shuffle data and thus be slow on large
datasets? I was wondering if there is a way to join via state key value
lookups to avoid the shuffling?

I read the docs and the blogs so far, thus have some limited understanding
of how Flink works, no practical experience though.

Thanks


*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*

Re: streaming state

Posted by Alex Rovner <al...@magnetic.com>.
Thank you Stephan for the information!

On Mon, Dec 14, 2015 at 5:20 AM Stephan Ewen <se...@apache.org> wrote:

> Hi Alex!
>
> Right now, Flink would not reuse Kafka's partitioning for joins, but
> shuffle/partition data by itself. Flink is very fast at shuffling and adds
> very little latency on shuffles, so that is usually not an issue. The
> reason that design is that we view streaming program as something dynamic:
> Kafka partitions may be added or removed during a program's life time, and
> the parallelism (and with that the partitioning scheme) can change as well.
> With Flink handling the partitioning internally, these cases are all
> covered.
>
> Concerning the join: The built-in join is definitely able to handle
> millions or records in a window, and scales well. What it does is windowing
> the streams together and joining within the windows. If you want responses
> within a second, you should make the window small enough that it evaluated
> every 500ms or so.
>
> If you want super low latency joins, you can look into using custom state
> to do that. With that, you could build your custom symmetric hash join for
> example. That has virtually zero latency and you can control how long each
> side keeps the data.
>
> Concerning key lookups vs shuffling: The shuffle variant is usually much
> faster, because it uses network better. The shuffle is fully pipelined,
> many records are in shuffle at the same time, it is optimized for
> throughput and can still keep the latency quite low (
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> ).
> In contrast, key lookups that avoid shuffling usually take a bit of time
> (millisecond or so) and limit any throughput a lot because they involve
> many smaller messages and add even more latency (roundtrip between nodes,
> rather than one way).
>
>
> Hope that this answers your question, let me know if you have more
> questions!
>
> Greetings,
> Stephan
>
>
> On Fri, Dec 11, 2015 at 4:00 PM, Alex Rovner <al...@magnetic.com>
> wrote:
>
>> Hello all,
>>
>> I was wondering if someone would be kind enough to enlighten me on a few
>> topics. We are trying to join two streams of data on a key. We were
>> thinking of partitioning topics in Kafka by the key, however I also saw
>> that Flink is able to partition on its own and I was wondering whether
>> Flink can take advantage of Kafka's partitioning and/or which partitioning
>> scheme should I go for?
>>
>> As far as joins, our two datasets are very large (millions of records in
>> each window) and we need to perform the joins very quickly (less than 1
>> sec). Would the built in join mechanism be sufficient for this? From what I
>> understand it would need to shuffle data and thus be slow on large
>> datasets? I was wondering if there is a way to join via state key value
>> lookups to avoid the shuffling?
>>
>> I read the docs and the blogs so far, thus have some limited
>> understanding of how Flink works, no practical experience though.
>>
>> Thanks
>>
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * <http://www.magnetic.com/>*
>>
>
> --
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*

Re: streaming state

Posted by Stephan Ewen <se...@apache.org>.
Hi Alex!

Right now, Flink would not reuse Kafka's partitioning for joins, but
shuffle/partition data by itself. Flink is very fast at shuffling and adds
very little latency on shuffles, so that is usually not an issue. The
reason that design is that we view streaming program as something dynamic:
Kafka partitions may be added or removed during a program's life time, and
the parallelism (and with that the partitioning scheme) can change as well.
With Flink handling the partitioning internally, these cases are all
covered.

Concerning the join: The built-in join is definitely able to handle
millions or records in a window, and scales well. What it does is windowing
the streams together and joining within the windows. If you want responses
within a second, you should make the window small enough that it evaluated
every 500ms or so.

If you want super low latency joins, you can look into using custom state
to do that. With that, you could build your custom symmetric hash join for
example. That has virtually zero latency and you can control how long each
side keeps the data.

Concerning key lookups vs shuffling: The shuffle variant is usually much
faster, because it uses network better. The shuffle is fully pipelined,
many records are in shuffle at the same time, it is optimized for
throughput and can still keep the latency quite low (
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
).
In contrast, key lookups that avoid shuffling usually take a bit of time
(millisecond or so) and limit any throughput a lot because they involve
many smaller messages and add even more latency (roundtrip between nodes,
rather than one way).


Hope that this answers your question, let me know if you have more
questions!

Greetings,
Stephan


On Fri, Dec 11, 2015 at 4:00 PM, Alex Rovner <al...@magnetic.com>
wrote:

> Hello all,
>
> I was wondering if someone would be kind enough to enlighten me on a few
> topics. We are trying to join two streams of data on a key. We were
> thinking of partitioning topics in Kafka by the key, however I also saw
> that Flink is able to partition on its own and I was wondering whether
> Flink can take advantage of Kafka's partitioning and/or which partitioning
> scheme should I go for?
>
> As far as joins, our two datasets are very large (millions of records in
> each window) and we need to perform the joins very quickly (less than 1
> sec). Would the built in join mechanism be sufficient for this? From what I
> understand it would need to shuffle data and thus be slow on large
> datasets? I was wondering if there is a way to join via state key value
> lookups to avoid the shuffling?
>
> I read the docs and the blogs so far, thus have some limited understanding
> of how Flink works, no practical experience though.
>
> Thanks
>
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * <http://www.magnetic.com/>*
>