You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Gagan Agrawal <ag...@gmail.com> on 2018/11/24 13:08:04 UTC

Joining more than 2 streams

Hi,
I want to do window join on multiple Kafka streams (say a, b, c) on common
field in all 3 streams and apply some custom function on joined stream. As
I understand we can join only 2 streams at a time via DataStream api. So
may be I need to join a and b first and then join first joined stream with
c. I want to understand how would stream state be stored in backend? Since
I will be joining a and b stream first, I believe both streams will be
stored in state backend for window time. And then again join of first
joined stream (of a and b) with c will result storage of all 3 streams for
windowed period. Does that mean stream a and b are stored twice in state
backend?

Let's say instead of using inbuilt join api, if I rather union all 3
streams (after transforming them to common schema) and keyBy stream on
common field and apply process function where I implement joining on my own
and store streams in some state backend, will that be more storage
efficient as I will be saving 3 streams just once instead of twice?

Gagan

Re: Joining more than 2 streams

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

Yes, your reasoning is correct. If you use two binary joins, the data of
the first two streams will be buffered twice.
Unioning all three streams and joining them in a custom ProcessFunction
would reduce the amount of required state.

Best, Fabian

Am Sa., 24. Nov. 2018 um 14:08 Uhr schrieb Gagan Agrawal <
agrawalgagan@gmail.com>:

> Hi,
> I want to do window join on multiple Kafka streams (say a, b, c) on common
> field in all 3 streams and apply some custom function on joined stream. As
> I understand we can join only 2 streams at a time via DataStream api. So
> may be I need to join a and b first and then join first joined stream with
> c. I want to understand how would stream state be stored in backend? Since
> I will be joining a and b stream first, I believe both streams will be
> stored in state backend for window time. And then again join of first
> joined stream (of a and b) with c will result storage of all 3 streams for
> windowed period. Does that mean stream a and b are stored twice in state
> backend?
>
> Let's say instead of using inbuilt join api, if I rather union all 3
> streams (after transforming them to common schema) and keyBy stream on
> common field and apply process function where I implement joining on my own
> and store streams in some state backend, will that be more storage
> efficient as I will be saving 3 streams just once instead of twice?
>
> Gagan
>