You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Zhijiang(wangzhijiang999)" <wa...@aliyun.com> on 2018/10/10 05:26:14 UTC

回复:Small checkpoint data takes too much time

The checkpoint duration includes the processes of barrier alignment and state snapshot. Every task has to receive all the barriers from all the channels, then trriger to snapshot state.
I guess the barrier alignment may take long time for your case, and it is specially critical during backpressure. You can check the metric of "checkpointAlignmentTime" for confirmation.

Best,
Zhijiang
------------------------------------------------------------------
发件人:徐涛 <ha...@gmail.com>
发送时间:2018年10月10日(星期三) 13:13
收件人:user <us...@flink.apache.org>
主 题:Small checkpoint data takes too much time

Hi 
 I recently encounter a problem in production. I found checkpoint takes too much time, although it doesn`t affect the job execution.
 I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but the “lastCheckpointDuration” is about 160s! Because checkpoint data is small , I think it should not take that long time. I do not know why and which condition may influent the checkpoint time. Does anyone has encounter such problem?
 Thanks a lot.

Best
Henry


Re: Small checkpoint data takes too much time

Posted by 徐涛 <ha...@gmail.com>.
Hi Zhijiang,
	Thanks for your response.
	I add the checkpointAlignmentTime, the data shows that the checkpointDuration is about 150s, and the checkpointAlignmentTims is about 4s. There is a big gap between them.

Best
Henry

> 在 2018年10月10日,下午1:26,Zhijiang(wangzhijiang999) <wa...@aliyun.com> 写道:
> 
> The checkpoint duration includes the processes of barrier alignment and state snapshot. Every task has to receive all the barriers from all the channels, then trriger to snapshot state.
> I guess the barrier alignment may take long time for your case, and it is specially critical during backpressure. You can check the metric of "checkpointAlignmentTime" for confirmation.
> 
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人:徐涛 <ha...@gmail.com>
> 发送时间:2018年10月10日(星期三) 13:13
> 收件人:user <us...@flink.apache.org>
> 主 题:Small checkpoint data takes too much time
> 
> Hi 
>  I recently encounter a problem in production. I found checkpoint takes too much time, although it doesn`t affect the job execution.
>  I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but the “lastCheckpointDuration” is about 160s! Because checkpoint data is small , I think it should not take that long time. I do not know why and which condition may influent the checkpoint time. Does anyone has encounter such problem?
>  Thanks a lot.
> 
> Best
> Henry
>