You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Yifei Li <le...@gmail.com> on 2016/04/19 02:11:49 UTC

Does Flink support joining multiple streams based on event time window now?

Hi,

I am new to Flink and I've read some documentation and think Flink may fit
my scenario.

Here is my scenario:

1. Assume I have 3 streams: S1(id, name, email, action, date), S2(id, name,
email, level, date), S3(id, name, position, date).

*2. S2 always delays(hours to days, not determined..) *

3. Based on the event time, I want to join S1, S2 and S3 every 5 minutes.
The join is like a SQL join:
    select S1.name, S3.position from S1, S2, S3 where S1.id = S2.id and
S1.id = S3.id and S1.action = 'download' and S2.level = 5



Can I use Flink for my scenario? Is yes, can anyone point me to some
working examples(I found some examples but they are outdated), or tell me
some workaround to solve this problem? If no, can anyone tell me the
reasons?

Thanks,

Yifei

Re: Does Flink support joining multiple streams based on event time window now?

Posted by Till Rohrmann <tr...@apache.org>.
If the data exceeds the main memory of your machine, then you should use
the RocksDBStateBackend as a state backend. It allows you to store state
(including windows) on disk. Thus, the size of state you can store is then
limited by your hard disk capacity.

If the expected data size can be kept in memory, then you should use the
FileStateBackend which keeps your state data in memory and only writes it
to disk if you draw a checkpoint.

Cheers,
Till
​

On Tue, Apr 19, 2016 at 6:08 PM, Yifei Li <le...@gmail.com> wrote:

> Hi Till and Aljoscha,
>
> Thank you so much for your suggestions and I'll try them out. I have
> another question.
>
> Since S2 my be days delayed, so there are may be lots of windows and large
> amount of data stored in memory waiting for computation. How does Flink
> deal with that?
>
> Thanks,
>
> Yifei
>
> On Tue, Apr 19, 2016 at 2:55 AM, Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Yifei,
>>
>> if you don't wanna implement your own join operator, then you could also
>> chain two join operations. I created a small example to demonstrate that:
>> https://gist.github.com/tillrohrmann/c074b4eedb9deaf9c8ca2a5e124800f3.
>> However, bare in mind that for this approach you will construct two windows
>> which might be a bit more costly than Aljoscha's approach.
>>
>> Cheers,
>> Till
>>
>> On Tue, Apr 19, 2016 at 11:32 AM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Hi,
>>> right now, there is no built-in support for n-ary joins. I am working on
>>> this, however.
>>>
>>> For now you can simulate n-ary joins by using a tagged union and doing
>>> the join yourself in a WindowFunction. I created a small example that
>>> demonstrates this:
>>> https://gist.github.com/aljoscha/a2a213d90c7c1bc67e71fabaa82fba4a
>>>
>>> I hope this helps, and please let us know if you want to know more.
>>>
>>> Cheers,
>>> Aljoscha
>>>
>>> On Tue, 19 Apr 2016 at 02:11 Yifei Li <le...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new to Flink and I've read some documentation and think Flink may
>>>> fit my scenario.
>>>>
>>>> Here is my scenario:
>>>>
>>>> 1. Assume I have 3 streams: S1(id, name, email, action, date), S2(id,
>>>> name, email, level, date), S3(id, name, position, date).
>>>>
>>>> *2. S2 always delays(hours to days, not determined..) *
>>>>
>>>> 3. Based on the event time, I want to join S1, S2 and S3 every 5
>>>> minutes. The join is like a SQL join:
>>>>     select S1.name, S3.position from S1, S2, S3 where S1.id = S2.id and
>>>> S1.id = S3.id and S1.action = 'download' and S2.level = 5
>>>>
>>>>
>>>>
>>>> Can I use Flink for my scenario? Is yes, can anyone point me to some
>>>> working examples(I found some examples but they are outdated), or tell me
>>>> some workaround to solve this problem? If no, can anyone tell me the
>>>> reasons?
>>>>
>>>> Thanks,
>>>>
>>>> Yifei
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Does Flink support joining multiple streams based on event time window now?

Posted by Yifei Li <le...@gmail.com>.
Hi Till and Aljoscha,

Thank you so much for your suggestions and I'll try them out. I have
another question.

Since S2 my be days delayed, so there are may be lots of windows and large
amount of data stored in memory waiting for computation. How does Flink
deal with that?

Thanks,

Yifei

On Tue, Apr 19, 2016 at 2:55 AM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Yifei,
>
> if you don't wanna implement your own join operator, then you could also
> chain two join operations. I created a small example to demonstrate that:
> https://gist.github.com/tillrohrmann/c074b4eedb9deaf9c8ca2a5e124800f3.
> However, bare in mind that for this approach you will construct two windows
> which might be a bit more costly than Aljoscha's approach.
>
> Cheers,
> Till
>
> On Tue, Apr 19, 2016 at 11:32 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Hi,
>> right now, there is no built-in support for n-ary joins. I am working on
>> this, however.
>>
>> For now you can simulate n-ary joins by using a tagged union and doing
>> the join yourself in a WindowFunction. I created a small example that
>> demonstrates this:
>> https://gist.github.com/aljoscha/a2a213d90c7c1bc67e71fabaa82fba4a
>>
>> I hope this helps, and please let us know if you want to know more.
>>
>> Cheers,
>> Aljoscha
>>
>> On Tue, 19 Apr 2016 at 02:11 Yifei Li <le...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am new to Flink and I've read some documentation and think Flink may
>>> fit my scenario.
>>>
>>> Here is my scenario:
>>>
>>> 1. Assume I have 3 streams: S1(id, name, email, action, date), S2(id,
>>> name, email, level, date), S3(id, name, position, date).
>>>
>>> *2. S2 always delays(hours to days, not determined..) *
>>>
>>> 3. Based on the event time, I want to join S1, S2 and S3 every 5
>>> minutes. The join is like a SQL join:
>>>     select S1.name, S3.position from S1, S2, S3 where S1.id = S2.id and
>>> S1.id = S3.id and S1.action = 'download' and S2.level = 5
>>>
>>>
>>>
>>> Can I use Flink for my scenario? Is yes, can anyone point me to some
>>> working examples(I found some examples but they are outdated), or tell me
>>> some workaround to solve this problem? If no, can anyone tell me the
>>> reasons?
>>>
>>> Thanks,
>>>
>>> Yifei
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: Does Flink support joining multiple streams based on event time window now?

Posted by Till Rohrmann <tr...@apache.org>.
Hi Yifei,

if you don't wanna implement your own join operator, then you could also
chain two join operations. I created a small example to demonstrate that:
https://gist.github.com/tillrohrmann/c074b4eedb9deaf9c8ca2a5e124800f3.
However, bare in mind that for this approach you will construct two windows
which might be a bit more costly than Aljoscha's approach.

Cheers,
Till

On Tue, Apr 19, 2016 at 11:32 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> right now, there is no built-in support for n-ary joins. I am working on
> this, however.
>
> For now you can simulate n-ary joins by using a tagged union and doing the
> join yourself in a WindowFunction. I created a small example that
> demonstrates this:
> https://gist.github.com/aljoscha/a2a213d90c7c1bc67e71fabaa82fba4a
>
> I hope this helps, and please let us know if you want to know more.
>
> Cheers,
> Aljoscha
>
> On Tue, 19 Apr 2016 at 02:11 Yifei Li <le...@gmail.com> wrote:
>
>> Hi,
>>
>> I am new to Flink and I've read some documentation and think Flink may
>> fit my scenario.
>>
>> Here is my scenario:
>>
>> 1. Assume I have 3 streams: S1(id, name, email, action, date), S2(id,
>> name, email, level, date), S3(id, name, position, date).
>>
>> *2. S2 always delays(hours to days, not determined..) *
>>
>> 3. Based on the event time, I want to join S1, S2 and S3 every 5 minutes.
>> The join is like a SQL join:
>>     select S1.name, S3.position from S1, S2, S3 where S1.id = S2.id and
>> S1.id = S3.id and S1.action = 'download' and S2.level = 5
>>
>>
>>
>> Can I use Flink for my scenario? Is yes, can anyone point me to some
>> working examples(I found some examples but they are outdated), or tell me
>> some workaround to solve this problem? If no, can anyone tell me the
>> reasons?
>>
>> Thanks,
>>
>> Yifei
>>
>>
>>
>>
>>
>>

Re: Does Flink support joining multiple streams based on event time window now?

Posted by Aljoscha Krettek <al...@apache.org>.
Hi,
right now, there is no built-in support for n-ary joins. I am working on
this, however.

For now you can simulate n-ary joins by using a tagged union and doing the
join yourself in a WindowFunction. I created a small example that
demonstrates this:
https://gist.github.com/aljoscha/a2a213d90c7c1bc67e71fabaa82fba4a

I hope this helps, and please let us know if you want to know more.

Cheers,
Aljoscha

On Tue, 19 Apr 2016 at 02:11 Yifei Li <le...@gmail.com> wrote:

> Hi,
>
> I am new to Flink and I've read some documentation and think Flink may fit
> my scenario.
>
> Here is my scenario:
>
> 1. Assume I have 3 streams: S1(id, name, email, action, date), S2(id,
> name, email, level, date), S3(id, name, position, date).
>
> *2. S2 always delays(hours to days, not determined..) *
>
> 3. Based on the event time, I want to join S1, S2 and S3 every 5 minutes.
> The join is like a SQL join:
>     select S1.name, S3.position from S1, S2, S3 where S1.id = S2.id and
> S1.id = S3.id and S1.action = 'download' and S2.level = 5
>
>
>
> Can I use Flink for my scenario? Is yes, can anyone point me to some
> working examples(I found some examples but they are outdated), or tell me
> some workaround to solve this problem? If no, can anyone tell me the
> reasons?
>
> Thanks,
>
> Yifei
>
>
>
>
>
>