You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Alex Rovner <al...@magnetic.com> on 2016/01/05 01:23:44 UTC
Re: streaming state

Thank you Stephan for the information!

On Mon, Dec 14, 2015 at 5:20 AM Stephan Ewen <se...@apache.org> wrote:

> Hi Alex!
>
> Right now, Flink would not reuse Kafka's partitioning for joins, but
> shuffle/partition data by itself. Flink is very fast at shuffling and adds
> very little latency on shuffles, so that is usually not an issue. The
> reason that design is that we view streaming program as something dynamic:
> Kafka partitions may be added or removed during a program's life time, and
> the parallelism (and with that the partitioning scheme) can change as well.
> With Flink handling the partitioning internally, these cases are all
> covered.
>
> Concerning the join: The built-in join is definitely able to handle
> millions or records in a window, and scales well. What it does is windowing
> the streams together and joining within the windows. If you want responses
> within a second, you should make the window small enough that it evaluated
> every 500ms or so.
>
> If you want super low latency joins, you can look into using custom state
> to do that. With that, you could build your custom symmetric hash join for
> example. That has virtually zero latency and you can control how long each
> side keeps the data.
>
> Concerning key lookups vs shuffling: The shuffle variant is usually much
> faster, because it uses network better. The shuffle is fully pipelined,
> many records are in shuffle at the same time, it is optimized for
> throughput and can still keep the latency quite low (
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> ).
> In contrast, key lookups that avoid shuffling usually take a bit of time
> (millisecond or so) and limit any throughput a lot because they involve
> many smaller messages and add even more latency (roundtrip between nodes,
> rather than one way).
>
>
> Hope that this answers your question, let me know if you have more
> questions!
>
> Greetings,
> Stephan
>
>
> On Fri, Dec 11, 2015 at 4:00 PM, Alex Rovner <al...@magnetic.com>
> wrote:
>
>> Hello all,
>>
>> I was wondering if someone would be kind enough to enlighten me on a few
>> topics. We are trying to join two streams of data on a key. We were
>> thinking of partitioning topics in Kafka by the key, however I also saw
>> that Flink is able to partition on its own and I was wondering whether
>> Flink can take advantage of Kafka's partitioning and/or which partitioning
>> scheme should I go for?
>>
>> As far as joins, our two datasets are very large (millions of records in
>> each window) and we need to perform the joins very quickly (less than 1
>> sec). Would the built in join mechanism be sufficient for this? From what I
>> understand it would need to shuffle data and thus be slow on large
>> datasets? I was wondering if there is a way to join via state key value
>> lookups to avoid the shuffling?
>>
>> I read the docs and the blogs so far, thus have some limited
>> understanding of how Flink works, no practical experience though.
>>
>> Thanks
>>
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * <http://www.magnetic.com/>*
>>
>
> --
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*