You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Stephen Connolly <st...@gmail.com> on 2020/03/23 15:14:33 UTC

How to debug checkpoints failing to complete

We have a topology and the checkpoints fail to complete a *lot* of the time.

Typically it is just one subtask that fails.

We have a parallelism of 2 on this topology at present and the other
subtask will complete in 3ms though the end to end duration on the rare
times when the checkpointing completes is like 4m30

How can I start debugging this? When I run locally on my development
cluster I have no issues, the issues only seem to show in production.