You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Nick Tinnemeier <nt...@bol.com> on 2016/12/16 15:24:35 UTC

Question about expired checkpoints

Hi all,

I am currently playing around with checkpoints to better understand how they work. I have some questions I hope you can answer.
I am running a simple topology with a source, a map and a sink that writes the events it receives to a HBase table. The parallelism of the environment is set to 10. Moreover, we set the parallelism of the checkpoints to 20. The source is a custom one, implementing CheckpointListener interface. In the notifyCheckpointComplete method of the source we simply log the checkpoint id of each checkpoint that is being notified.

When I run the application, I notice that sometimes not all checkpoints are notified. Indeed, when I look into the logs of the application I see messages like:
"Checkpoint 17 expired before completing"

I am sure the checkpoint did not timeout. What exactly does this mean and when does it happen? It makes me wonder if this means that checkpoints can actually overtake each other. In other words, can it actually happen that a checkpoint x arrives sooner at the sink than a checkpoint x-1 that was sent earlier than x?

Kind regards,
Nick.

Re: Question about expired checkpoints

Posted by Stephan Ewen <se...@apache.org>.

Hi Nick!

In general, checkpoints cannot overtake each other. It can happen (in the
presence of failure/recovery) that a checkpoint is "half complete" and
subsumed by a newer complete checkpoint.
The message "Checkpoint 17 expired before completing" might be correct -
you could check the start time of the checkpoint and timeout setting to
validate that.

In general, the "notifyCheckpointComplete" method is not guaranteed to be
called for every checkpoint (it can be skipped in presence of
failure/recovery).

Stephan

On Fri, Dec 16, 2016 at 4:24 PM, Nick Tinnemeier <nt...@bol.com>
wrote:

> Hi all,
>
>
>
> I am currently playing around with checkpoints to better understand how
> they work. I have some questions I hope you can answer.
>
> I am running a simple topology with a source, a map and a sink that writes
> the events it receives to a HBase table. The parallelism of the environment
> is set to 10. Moreover, we set the parallelism of the checkpoints to 20.
> The source is a custom one, implementing CheckpointListener interface. In
> the notifyCheckpointComplete method of the source we simply log the
> checkpoint id of each checkpoint that is being notified.
>
>
>
> When I run the application, I notice that sometimes not all checkpoints
> are notified. Indeed, when I look into the logs of the application I see
> messages like:
>
> "Checkpoint 17 expired before completing"
>
>
>
> I am sure the checkpoint did not timeout. What exactly does this mean and
> when does it happen? It makes me wonder if this means that checkpoints can
> actually overtake each other. In other words, can it actually happen that a
> checkpoint x arrives sooner at the sink than a checkpoint x-1 that was sent
> earlier than x?
>
>
>
> Kind regards,
>
> Nick.
>
>
>