You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by 徐涛 <ha...@gmail.com> on 2018/10/24 09:02:57 UTC
Checkpoint acknowledge takes too long
Hi
I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
Thank a lot.
Best
Henry
Re: Checkpoint acknowledge takes too long
Posted by Hequn Cheng <ch...@gmail.com>.
Hi Henry,
Thanks for letting us know.
On Thu, Oct 25, 2018 at 7:34 PM 徐涛 <ha...@gmail.com> wrote:
> Hi Hequn & Kien,
> Finally the problem is solved.
> It is due to slow sink write. Because the job only have 2 tasks, I check
> the backpressure, found that the source has high backpressure, so I tried
> to improve the sink write. After that the end to end duration is below 1s
> and the checkpoint timeout is fixed.
>
> Best
> Henry
>
>
> 在 2018年10月24日,下午10:43,徐涛 <ha...@gmail.com> 写道:
>
> Hequn & Kien,
> Thanks a lot for your help, I will try it later.
>
> Best
> Henry
>
>
> 在 2018年10月24日,下午8:18,Hequn Cheng <ch...@gmail.com> 写道:
>
> Hi Henry,
>
> @Kien is right. Take a thread dump to see what was doing in the
> TaskManager. Also check whether gc happens frequently.
>
> Best, Hequn
>
>
> On Wed, Oct 24, 2018 at 5:03 PM 徐涛 <ha...@gmail.com> wrote:
>
>> Hi
>> I am running a flink application with parallelism 64, I left the
>> checkpoint timeout default value, which is 10minutes, the state size is
>> less than 1MB, I am using the FsStateBackend.
>> The application triggers some checkpoints but all of them fails
>> due to "Checkpoint expired before completing”, I check the checkpoint
>> history, found that there are 63 subtask acknowledge, but one left n/a, and
>> also the alignment duration is quite long, about 5m27s.
>> I want to know why there is one subtask does not acknowledge? And
>> because the alignment duration is long, what will influent the alignment
>> duration?
>> Thank a lot.
>>
>> Best
>> Henry
>
>
>
>
Re: Checkpoint acknowledge takes too long
Posted by 徐涛 <ha...@gmail.com>.
Hi Hequn & Kien,
Finally the problem is solved.
It is due to slow sink write. Because the job only have 2 tasks, I check the backpressure, found that the source has high backpressure, so I tried to improve the sink write. After that the end to end duration is below 1s and the checkpoint timeout is fixed.
Best
Henry
> 在 2018年10月24日,下午10:43,徐涛 <ha...@gmail.com> 写道:
>
> Hequn & Kien,
> Thanks a lot for your help, I will try it later.
>
> Best
> Henry
>
>
>> 在 2018年10月24日,下午8:18,Hequn Cheng <chenghequn@gmail.com <ma...@gmail.com>> 写道:
>>
>> Hi Henry,
>>
>> @Kien is right. Take a thread dump to see what was doing in the TaskManager. Also check whether gc happens frequently.
>>
>> Best, Hequn
>>
>>
>> On Wed, Oct 24, 2018 at 5:03 PM 徐涛 <happydexutao@gmail.com <ma...@gmail.com>> wrote:
>> Hi
>> I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
>> The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
>> I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
>> Thank a lot.
>>
>> Best
>> Henry
>
Re: Checkpoint acknowledge takes too long
Posted by Hequn Cheng <ch...@gmail.com>.
Hi Henry,
@Kien is right. Take a thread dump to see what was doing in the
TaskManager. Also check whether gc happens frequently.
Best, Hequn
On Wed, Oct 24, 2018 at 5:03 PM 徐涛 <ha...@gmail.com> wrote:
> Hi
> I am running a flink application with parallelism 64, I left the
> checkpoint timeout default value, which is 10minutes, the state size is
> less than 1MB, I am using the FsStateBackend.
> The application triggers some checkpoints but all of them fails
> due to "Checkpoint expired before completing”, I check the checkpoint
> history, found that there are 63 subtask acknowledge, but one left n/a, and
> also the alignment duration is quite long, about 5m27s.
> I want to know why there is one subtask does not acknowledge? And
> because the alignment duration is long, what will influent the alignment
> duration?
> Thank a lot.
>
> Best
> Henry
Re: Checkpoint acknowledge takes too long
Posted by Kien Truong <du...@gmail.com>.
Hi,
In my experience, this is most likely due to one sub-task is blocked
doing some long-running operation.
Try to run the task manager with some profiler (like VisualVM) and check
for hot spot.
Regards,
Kien
On 10/24/2018 4:02 PM, 徐涛 wrote:
> Hi
> I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
> The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
> I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
> Thank a lot.
>
> Best
> Henry