You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by hao kong <ha...@lemonbox.me> on 2020/09/15 11:04:42 UTC

I have a job with multiple Kafka sources. They all contain certain historical data.

Hello, I have a job with multiple Kafka sources. They all contain certain
historical data. If you use the events-time window, it will cause sources
with less data to cover more sources through water mark. Is there a
solution?

Re: I have a job with multiple Kafka sources. They all contain certain historical data.

Posted by Piotr Nowojski <pn...@apache.org>.

Great, thanks for the update! And please share your feedback if it worked
or not.

Piotrek

niedz., 27 wrz 2020 o 11:20 hao kong <ha...@lemonbox.me> napisał(a):

> Thanks for the tip!
>      I am currently trying to implement a zookeeper-based coordinator.use
> it to record the current watermark and control streaming according to your
> first suggest.
>
> Piotr Nowojski <pn...@apache.org> 于2020年9月16日周三 下午11:56写道：
>
>> Hey,
>>
>> If you are worried about increased amount of buffered data by the
>> WindowOperator if watermarks/event time is not progressing uniformly across
>> multiple sources, then there is little you can do currently. FLIP-27 [1]
>> will allow us to address this problem in more generic way. What you can
>> currently do is one of two things:
>>
>> 1. Implement a custom throttling function/operator sitting after the
>> sources, that would throttle the sources. If you chain it with the source
>> function, it's relatively ok solution. Note, while you are blocking
>> execution, you will be blocking for example checkpoints from happening. So
>> it's better to sleep 10 ms per every record, compared to sleep 10 seconds
>> once every 1000 records.
>> 2. Throttle the sources themselves (you would need to modify or write
>> your custom sources).
>>
>> But in both cases you need to manually track the event time, and manually
>> make decision which source should be throttled and by how much.
>>
>> Best regards, Piotrek
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-EventTimeAlignment
>>
>> śr., 16 wrz 2020 o 04:17 hao kong <ha...@lemonbox.me> napisał(a):
>>
>>> Hello guys,
>>>
>>> I have a job with multiple Kafka sources. They all contain certain
>>> historical data. If you use the events-time window, it will cause sources
>>> with less data to cover more sources through water mark.
>>>
>>>
>>> I can think of a solution, Implement a scheduler in the source phase,
>>> But it is quite complicated to implement. Are ther otherbetter solutions?
>>>
>>>
>>> Any suggestions?
>>> Thanks!
>>>
>>>
>>>

Re: I have a job with multiple Kafka sources. They all contain certain historical data.

Posted by hao kong <ha...@lemonbox.me>.

Thanks for the tip!
     I am currently trying to implement a zookeeper-based coordinator.use
it to record the current watermark and control streaming according to your
first suggest.

Piotr Nowojski <pn...@apache.org> 于2020年9月16日周三 下午11:56写道：

> Hey,
>
> If you are worried about increased amount of buffered data by the
> WindowOperator if watermarks/event time is not progressing uniformly across
> multiple sources, then there is little you can do currently. FLIP-27 [1]
> will allow us to address this problem in more generic way. What you can
> currently do is one of two things:
>
> 1. Implement a custom throttling function/operator sitting after the
> sources, that would throttle the sources. If you chain it with the source
> function, it's relatively ok solution. Note, while you are blocking
> execution, you will be blocking for example checkpoints from happening. So
> it's better to sleep 10 ms per every record, compared to sleep 10 seconds
> once every 1000 records.
> 2. Throttle the sources themselves (you would need to modify or write your
> custom sources).
>
> But in both cases you need to manually track the event time, and manually
> make decision which source should be throttled and by how much.
>
> Best regards, Piotrek
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-EventTimeAlignment
>
> śr., 16 wrz 2020 o 04:17 hao kong <ha...@lemonbox.me> napisał(a):
>
>> Hello guys,
>>
>> I have a job with multiple Kafka sources. They all contain certain
>> historical data. If you use the events-time window, it will cause sources
>> with less data to cover more sources through water mark.
>>
>>
>> I can think of a solution, Implement a scheduler in the source phase, But
>> it is quite complicated to implement. Are ther otherbetter solutions?
>>
>>
>> Any suggestions?
>> Thanks!
>>
>>
>>

Re: I have a job with multiple Kafka sources. They all contain certain historical data.

Posted by Piotr Nowojski <pn...@apache.org>.

Hey,

If you are worried about increased amount of buffered data by the
WindowOperator if watermarks/event time is not progressing uniformly across
multiple sources, then there is little you can do currently. FLIP-27 [1]
will allow us to address this problem in more generic way. What you can
currently do is one of two things:

1. Implement a custom throttling function/operator sitting after the
sources, that would throttle the sources. If you chain it with the source
function, it's relatively ok solution. Note, while you are blocking
execution, you will be blocking for example checkpoints from happening. So
it's better to sleep 10 ms per every record, compared to sleep 10 seconds
once every 1000 records.
2. Throttle the sources themselves (you would need to modify or write your
custom sources).

But in both cases you need to manually track the event time, and manually
make decision which source should be throttled and by how much.

Best regards, Piotrek

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-EventTimeAlignment

śr., 16 wrz 2020 o 04:17 hao kong <ha...@lemonbox.me> napisał(a):

> Hello guys,
>
> I have a job with multiple Kafka sources. They all contain certain
> historical data. If you use the events-time window, it will cause sources
> with less data to cover more sources through water mark.
>
>
> I can think of a solution, Implement a scheduler in the source phase, But
> it is quite complicated to implement. Are ther otherbetter solutions?
>
>
> Any suggestions?
> Thanks!
>
>
>

Fwd: I have a job with multiple Kafka sources. They all contain certain historical data.

Posted by hao kong <ha...@lemonbox.me>.

Hello guys,

I have a job with multiple Kafka sources. They all contain certain
historical data. If you use the events-time window, it will cause sources
with less data to cover more sources through water mark.


I can think of a solution, Implement a scheduler in the source phase, But
it is quite complicated to implement. Are ther otherbetter solutions?


Any suggestions?
Thanks!