You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Paul Lam <pa...@gmail.com> on 2022/02/09 04:49:25 UTC

Re: Inaccurate checkpoint trigger time

Hi Robert & Yun,

Thanks a lot for your helpful analysis! I’ve confirmed that it’s the checkpoint cleanup problem that caused the inaccurate checkpoint trigger time.

Best,
Paul Lam

> 2022年1月30日 19:45,Yun Tang <my...@live.com> 写道:
> 
> Hi Paul,
> 
> I think Robert's idea might be right.
> 
> From the log you pasted, the checkpoint interval is 2m30s. Chk-5 triggered at 16:42:23 and completed at 16:42:42.
> In the normal case, chk-6 would be triggered near 16:44:53. However, the actual chk-6 triggered at 16:46:02, which is obviously not normal case.
> 
> I think your analysis is not correct due to the log below:
> 2022-01-27 16:46:02,693 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 6 (type=CHECKPOINT) @ 1643273162422 for job 3a57fdaa16502c411a46471bba595d7c.
> 
> If you translate the unix time 1643273162422 to Beijing time, you can see this is actually 2022-01-27 16:46:02.
> If the Zookeeper is really slow to respond, the unix time should be much earlier than the logged time [1].
> 
> Flink has been improved that checkpointing would be backpressured on slow cleanup [2].
> 
> 
> 
> [1] https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L540 <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L540>
> [2] https://issues.apache.org/jira/browse/FLINK-17073 <https://issues.apache.org/jira/browse/FLINK-17073>
> 
> Best
> Yun Tang
> 
> 
> 
> From: Robert Metzger <me...@gmail.com>
> Sent: Friday, January 28, 2022 21:53
> To: Paul Lam <pa...@gmail.com>
> Cc: Yun Tang <my...@live.com>; user <us...@flink.apache.org>
> Subject: Re: Inaccurate checkpoint trigger time
>  
> Hi Paul,
> 
> where are you storing your checkpoints, and what's their size?
> 
> IIRC, Flink won't trigger a new checkpoint before the old ones haven't been cleaned up, and if your checkpoints are large and stored on S3, it can take a while to clean them up (especially with the Hadoop S3 plugin, using presto s3 is faster).
> 
> 
> 
> 
> On Thu, Jan 27, 2022 at 10:56 AM Paul Lam <paullin3280@gmail.com <ma...@gmail.com>> wrote:
> Hi Yun,
> 
> Sorry for the late reply. I finally found some time to investigate this problem further. I upgraded the job to 1.14.0, but it’s still the same.
> 
> I’ve checked the debug logs, and I found that Zookeeper notifies watched event of checkpoint id changes very late [1]. Each time a checkpoint finished, it would take minutes before the Zookeeper client notices the checkpoint ID is changed.
> 
> I suspect the checkpoint coordinator is blocking on incrementing checkpoint ID on Zookeeper [2]. But with no luck, there’s no many relevant logs can help me prove that. 
> 
> What do you think of this? Thanks a lot!
> 
> [1] https://gist.github.com/link3280/5072a054a43b40ba28891837a8fdf995 <https://gist.github.com/link3280/5072a054a43b40ba28891837a8fdf995>
> [2] https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L743 <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L743>
> 
> Best,
> Paul Lam
> 
>> 2021年11月23日 16:49,Paul Lam <paullin3280@gmail.com <ma...@gmail.com>> 写道:
>> 
>> Hi Yun,
>> 
>> Thanks a lot for your pointers! I’ll try it out as you suggested and then get back to you.
>> 
>> Best,
>> Paul Lam
>> 
>>> 2021年11月23日 16:32,Yun Tang <myasuka@live.com <ma...@live.com>> 写道:
>>> 
>>> Hi Paul,
>>> 
>>> This is really weird, from what I know, flink-1.11.0 has a problem of handling min-pause time [1] and this should be resolved in flink-1.12.1.
>>> 
>>> Could you open the debug log level for org.apache.flink.runtime.checkpoint and use jmap or byteman to get the field value of CheckpointCoordinator#lastCheckpointCompletionRelativeTime, CheckpointRequestDecider#minPauseBetweenCheckpoints and SystemClock#relativeTimeMillis in method CheckpointRequestDecider#nextTriggerDelayMillis [2] to see any unexpected behavior.
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/FLINK-18856 <https://issues.apache.org/jira/browse/FLINK-18856>
>>> [2] https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182 <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182>
>>> 
>>> 
>>> Best
>>> Yun Tang
>>> 
>>> From: Paul Lam <paullin3280@gmail.com <ma...@gmail.com>>
>>> Sent: Tuesday, November 23, 2021 14:35
>>> To: user <user@flink.apache.org <ma...@flink.apache.org>>
>>> Subject: Inaccurate checkpoint trigger time
>>>  
>>> Hi,
>>> 
>>> Recently I’ve noticed a job has nondeterministic checkpoint trigger time. 
>>> 
>>> The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 parallelism. It was configured to trigger checkpoint every 150 seconds with 0 pause time and no concurrent checkpoints. However there’re obvious errors in the checkpoint trigger times, as the actual interval may vary from 30 seconds to 6 minutes.
>>> 
>>> The jobmanager logs are good, and no error logs is found. Some of the output are as follow: 
>>> 
>>> 
>>> 2021-11-23 13:51:46,438 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes in 22166 ms).
>>> 
>>> 2021-11-23 13:57:21,021 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 13:57:43,761 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes in 21813 ms).
>>> 
>>> 2021-11-23 13:59:09,387 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 13:59:31,370 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes in 20354 ms).
>>> 
>>> 2021-11-23 14:06:37,916 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 14:07:03,157 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes in 23779 ms).
>>> 
>>> 2021-11-23 14:07:05,838 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 14:07:30,748 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes in 22998 ms).
>>> 
>>> 2021-11-23 14:13:09,089 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 14:13:38,411 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes in 27616 ms).
>>> 
>>> 2021-11-23 14:13:38,676 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 14:14:01,937 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes in 21869 ms).
>>> 
>>> 2021-11-23 14:20:04,923 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 14:20:26,592 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes in 20172 ms).
>>> 
>>> 2021-11-23 14:21:54,879 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job f432b8d90859db54f7a79ff29a563ee4.
>>> 
>>> 2021-11-23 14:22:19,392 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes in 22930 ms).
>>> 
>>> 
>>> It looks pretty weird to me. Please help me narrow down the problem if you have any idea.
>>> 
>>> Best,
>>> Paul Lam