You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dineth Kariyawasam <di...@zilingo.com> on 2021/12/11 03:32:17 UTC

Re: Random checkpoint failures with timeouts

Hi Yun,

This is the checkpoint history :

[image: checkpoint history.png]
As you can see, sometimes the first checkpoint fails and here and there we
have checkpoint failures in between the completed checkpoints.

First completed checkpoint status (2nd checkpoint):

[image: First completed checkpoint.png]
Sink functions take most of the time from the overall checkpoint duration.


On Tue, Nov 23, 2021 at 5:06 PM Yun Gao <yu...@aliyun.com> wrote:

> Hi Dineth,
>
> In the UI of flink there is pages for details for the checkpoints[1],
> could you have a look this UI
> to see which part of checkpoint took long time~?
>
> Best,
> Yun
>
>
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/
>
> ------------------------------------------------------------------
> From:Dineth Kariyawasam <di...@zilingo.com>
> Send Time:2021 Nov. 23 (Tue.) 17:32
> To:user <us...@flink.apache.org>
> Subject:Random checkpoint failures with timeouts
>
> Checkpoint fails randomly with a timeout. Many times this happens when
> there are no other events coming into flink (at night). Most of our
> incoming data is during the daytime, and at night there are usually no
> events. Many of these failures have been at night. We had set a checkpoint
> timeout of 2 minutes initially. We increased it to 5 minutes, and the
> frequency of failures have reduced after this. However, checkpointing never
> takes more than 100 seconds when it succeeds. There was one occurrence of
> it taking 118 seconds about a month ago. When it fails, it fails after
> waiting for 5 minutes.
>
> Exception log:
>
>
> *org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO 2021-10-22
> 18:22:57 +0000 line:1867 "Checkpoint 34 of job
> ec563be081b87033f7e5f9a94c86fd78 expired before
> completing."org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO
> 2021-10-22 18:22:57 +0000 line:710 "Triggering checkpoint 35
> (type=CHECKPOINT) @ 1634926977313 for job
> ec563be081b87033f7e5f9a94c86fd78."org.apache.flink.runtime.jobmaster.JobMaster
> INFO 2021-10-22 18:22:57 +0000 line:239 "Trying to recover from a global
> failure."*
>
> Flink version: 1.12.5
> Setup: 1 Job manager and 1 task manager.
> Checkpoint setup: RocksDB, once every 30 seconds, 2 minute timeout, 30
> seconds between checkpoints
>
>
>