You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Xiangyu Su <xi...@smaato.com> on 2021/09/01 07:52:26 UTC

Checkpointing failure, subtasks get stuck

Hello Everyone,
We were facing checkpointing failure issue since version 1.9, currently we
are using  version 1.13.2

We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
usually the checkpoint takes 10-30 seconds.
But sometimes I have seen Job failed and restarted due to checkpoint
timeout without huge increasing of incoming data... and also seen the
checkpointing progress of some subtasks get stuck by e.g 7% for 10 mins.
My guess would be somehow the thread for doing checkpointing get blocked...

Any suggestions? idea will be helpful, thanks


Best Regards,
-- 
Xiangyu Su
Java Developer
xiangyu@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.

Re: Checkpointing failure, subtasks get stuck

Posted by JING ZHANG <be...@gmail.com>.
Hi Xiangyu Su,
Because of the lack of detailed information, I could only give the
troubleshooting ideas. I hope it is helpful to you.
1. find out which checkpoint expire. You could find that in WEB UI [1] or
in `jobmanager.log`
2. find out operators which not finished checkpoint yet when the checkpoint
expire. You could find that in WEB UI checkpoint detailed information [1]
3. find out which stage of expired operator is slow, align duration  or
sync duration or async duration [1]
    If operator spent a long time in  align duration, please check whether
the job exists back pressure. You could find that in WEB UI BackPressure
part. You can enable unaligned checkpoints
<https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/>
[2] to greatly reduce checkpointing times under backpressure.
    If operator spent a long time in async duration, you could check
whether there is any network problem.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/
[2]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/

Best,
JING ZHANG

Xiangyu Su <xi...@smaato.com> 于2021年9月1日周三 下午3:52写道:

> Hello Everyone,
> We were facing checkpointing failure issue since version 1.9, currently we
> are using  version 1.13.2
>
> We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
> usually the checkpoint takes 10-30 seconds.
> But sometimes I have seen Job failed and restarted due to checkpoint
> timeout without huge increasing of incoming data... and also seen the
> checkpointing progress of some subtasks get stuck by e.g 7% for 10 mins.
> My guess would be somehow the thread for doing checkpointing get blocked...
>
> Any suggestions? idea will be helpful, thanks
>
>
> Best Regards,
> --
> Xiangyu Su
> Java Developer
> xiangyu@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
>
> Barcastraße 5
>
> 22087 Hamburg
>
> Germany
> M 0049(176)43330282
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>