You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Steffen Hausmann <st...@hausmann-family.de> on 2016/08/29 13:30:14 UTC

Submitting watermarks through a Kinesis stream

Hi there,

I'm feeding a Flink stream with events from a Kinesis stream and I'm 
looking for some guidance on how to enable event time in the Flink stream.

I've read through the documentation and it seems like I want to add 
events that carry watermark information to the Kinesis stream and 
subsequently use AssignerWithPunctuatedWatermarks to read and extract 
the watermark information to the Flink stream. However, as a Kinesis 
stream is composed from potentially multiple shards, which are similar 
to Kafka partitions, using a single event to determine the watermark off 
the Flink stream may affect the semantics of the system:

Kinesis guarantees the order within a single shard but not across the 
entire stream. So if a single watermark event is added to the stream, it 
ends up in a particular shard and this shard may be processed faster 
that others. Accordingly, when the event is read and used to determine 
the watermark in the Flink stream, there may be still unprocessed events 
in other shards with an event time that is lower than that of the 
already processed watermark event.

Therefore, it seems like I should submit a watermark event to every 
shard, keep track of the last watermark event for each shard, and use 
the minimum time of those watermark events to determine the watermark 
for the Flink stream.

Am I thinking too complicated here? Any guidance on how to implement 
this correctly is highly appreciated.

Thanks,
Steffen

Re: Submitting watermarks through a Kinesis stream

Posted by Steffen Hausmann <st...@hausmann-family.de>.
That's just awesome!

Thanks,
Steffen

On August 29, 2016 3:39:52 PM GMT+02:00, Stephan Ewen <se...@apache.org> wrote:
>You are thinking too complicated here ;-) because Flink internally
>already
>does all the logic of monitoring the minimum watermark across stream
>partitions.
>As long as you match the Flink source parallelism to the number of
>Kinesis
>shared, that part is taken care of for you.
>
>You only need to publish watermarks to the shared that describe that
>shard's particular event time.
>
>On Mon, Aug 29, 2016 at 3:30 PM, Steffen Hausmann <
>steffen@hausmann-family.de> wrote:
>
>> Hi there,
>>
>> I'm feeding a Flink stream with events from a Kinesis stream and I'm
>> looking for some guidance on how to enable event time in the Flink
>stream.
>>
>> I've read through the documentation and it seems like I want to add
>events
>> that carry watermark information to the Kinesis stream and
>subsequently use
>> AssignerWithPunctuatedWatermarks to read and extract the watermark
>> information to the Flink stream. However, as a Kinesis stream is
>composed
>> from potentially multiple shards, which are similar to Kafka
>partitions,
>> using a single event to determine the watermark off the Flink stream
>may
>> affect the semantics of the system:
>>
>> Kinesis guarantees the order within a single shard but not across the
>> entire stream. So if a single watermark event is added to the stream,
>it
>> ends up in a particular shard and this shard may be processed faster
>that
>> others. Accordingly, when the event is read and used to determine the
>> watermark in the Flink stream, there may be still unprocessed events
>in
>> other shards with an event time that is lower than that of the
>already
>> processed watermark event.
>>
>> Therefore, it seems like I should submit a watermark event to every
>shard,
>> keep track of the last watermark event for each shard, and use the
>minimum
>> time of those watermark events to determine the watermark for the
>Flink
>> stream.
>>
>> Am I thinking too complicated here? Any guidance on how to implement
>this
>> correctly is highly appreciated.
>>
>> Thanks,
>> Steffen
>>


Re: Submitting watermarks through a Kinesis stream

Posted by Stephan Ewen <se...@apache.org>.
You are thinking too complicated here ;-) because Flink internally already
does all the logic of monitoring the minimum watermark across stream
partitions.
As long as you match the Flink source parallelism to the number of Kinesis
shared, that part is taken care of for you.

You only need to publish watermarks to the shared that describe that
shard's particular event time.

On Mon, Aug 29, 2016 at 3:30 PM, Steffen Hausmann <
steffen@hausmann-family.de> wrote:

> Hi there,
>
> I'm feeding a Flink stream with events from a Kinesis stream and I'm
> looking for some guidance on how to enable event time in the Flink stream.
>
> I've read through the documentation and it seems like I want to add events
> that carry watermark information to the Kinesis stream and subsequently use
> AssignerWithPunctuatedWatermarks to read and extract the watermark
> information to the Flink stream. However, as a Kinesis stream is composed
> from potentially multiple shards, which are similar to Kafka partitions,
> using a single event to determine the watermark off the Flink stream may
> affect the semantics of the system:
>
> Kinesis guarantees the order within a single shard but not across the
> entire stream. So if a single watermark event is added to the stream, it
> ends up in a particular shard and this shard may be processed faster that
> others. Accordingly, when the event is read and used to determine the
> watermark in the Flink stream, there may be still unprocessed events in
> other shards with an event time that is lower than that of the already
> processed watermark event.
>
> Therefore, it seems like I should submit a watermark event to every shard,
> keep track of the last watermark event for each shard, and use the minimum
> time of those watermark events to determine the watermark for the Flink
> stream.
>
> Am I thinking too complicated here? Any guidance on how to implement this
> correctly is highly appreciated.
>
> Thanks,
> Steffen
>