You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user-zh@flink.apache.org by kong <62...@163.com> on 2022/03/03 13:15:37 UTC

flink 反压导致checkpoint超时，从而导致任务失败问题

hello,我最近遇到一个问题：
我通过flink消费kafka数据，job 图大概是这样的：Source -> map -> filter -> flatMap -> Map -> Sink
在一瞬间kafka的producer端会产生大量的数据，导致flink无法消费完，我的checkpoint设置的是10分钟；
最后会产生Checkpoint expired before completing.的错误，导致job重启，从而导致从上一个checkpoint恢复，然后重复消费数据，又导致checkpoint超时，死循环了。


不知道有什么好办法解决该问题。
多谢~

Re: flink 反压导致checkpoint超时，从而导致任务失败问题

Posted by Zhilong Hong <zh...@gmail.com>.

Hi, kong:

对于Checkpoint超时，建议调大反压处下游算子的并发度，避免出现长时间反压。

在配置方面，如果希望调大checkpoint超时时间，可以将execution.checkpointing.timeout配置项 [1]
的值调大，目前默认值为"10
min"，即10分钟会超时。此外还可以调大execution.checkpointing.tolerable-failed-checkpoints配置项
[2] 的值，容忍更多的checkpoint失败。

这些参数项只能减少checkpoint失败导致failover的次数，最佳方案是从作业入手，增加并发和资源或优化作业逻辑，避免长时间反压。在checkpoint做不出来的情况下，如果出现failover的话需要回刷大量数据，会对服务可用性造成影响。

祝好，
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#execution-checkpointing-timeout
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#execution-checkpointing-tolerable-failed-checkpoints

On Fri, Mar 4, 2022 at 10:48 AM yu'an huang <h....@gmail.com> wrote:

> 你好，我检查了下关于checkpoint的文档：
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
> <
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
> >
> tolerable checkpoint failure number: This defines how many consecutive
> checkpoint failures will be tolerated, before the whole job is failed over.
> The default value is 0, which means no checkpoint failures will be
> tolerated, and the job will fail on first reported checkpoint failure.
>
> 可以为作业设置容忍checkpoint失败的, 你可以像文档中说加下相关设置：
> // only two consecutive checkpoint failures are tolerated
> env.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
> 希望可以帮到你
>
>
>
>
> > On 4 Mar 2022, at 10:07 AM, yu'an huang <h.yuan667@gmail.com <mailto:
> h.yuan667@gmail.com>> wrote:
> >
> > 你好，checkpoint超时默认不会导致作业重启，可以提供下JM log看看作业为什么会重启吗？
> >
> >> On 3 Mar 2022, at 9:15 PM, kong <624dj@163.com <ma...@163.com>>
> wrote:
> >>
> >> hello,我最近遇到一个问题：
> >> 我通过flink消费kafka数据，job 图大概是这样的：Source -> map -> filter -> flatMap -> Map
> -> Sink
> >> 在一瞬间kafka的producer端会产生大量的数据，导致flink无法消费完，我的checkpoint设置的是10分钟；
> >> 最后会产生Checkpoint expired before
> completing.的错误，导致job重启，从而导致从上一个checkpoint恢复，然后重复消费数据，又导致checkpoint超时，死循环了。
> >>
> >>
> >> 不知道有什么好办法解决该问题。
> >> 多谢~
> >>
> >
>
>

Re: flink 反压导致checkpoint超时，从而导致任务失败问题

Posted by yu'an huang <h....@gmail.com>.

你好，我检查了下关于checkpoint的文档：https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/ <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/>
tolerable checkpoint failure number: This defines how many consecutive checkpoint failures will be tolerated, before the whole job is failed over. The default value is 0, which means no checkpoint failures will be tolerated, and the job will fail on first reported checkpoint failure.

可以为作业设置容忍checkpoint失败的, 你可以像文档中说加下相关设置：
// only two consecutive checkpoint failures are tolerated
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
希望可以帮到你

> On 4 Mar 2022, at 10:07 AM, yu'an huang <h.yuan667@gmail.com <ma...@gmail.com>> wrote:
> 
> 你好，checkpoint超时默认不会导致作业重启，可以提供下JM log看看作业为什么会重启吗？
> 
>> On 3 Mar 2022, at 9:15 PM, kong <624dj@163.com <ma...@163.com>> wrote:
>> 
>> hello,我最近遇到一个问题：
>> 我通过flink消费kafka数据，job 图大概是这样的：Source -> map -> filter -> flatMap -> Map -> Sink
>> 在一瞬间kafka的producer端会产生大量的数据，导致flink无法消费完，我的checkpoint设置的是10分钟；
>> 最后会产生Checkpoint expired before completing.的错误，导致job重启，从而导致从上一个checkpoint恢复，然后重复消费数据，又导致checkpoint超时，死循环了。
>> 
>> 
>> 不知道有什么好办法解决该问题。
>> 多谢~
>> 
>

Re: flink 反压导致checkpoint超时，从而导致任务失败问题

Posted by yu'an huang <h....@gmail.com>.

你好，checkpoint超时默认不会导致作业重启，可以提供下JM log看看作业为什么会重启吗？

> On 3 Mar 2022, at 9:15 PM, kong <62...@163.com> wrote:
> 
> hello,我最近遇到一个问题：
> 我通过flink消费kafka数据，job 图大概是这样的：Source -> map -> filter -> flatMap -> Map -> Sink
> 在一瞬间kafka的producer端会产生大量的数据，导致flink无法消费完，我的checkpoint设置的是10分钟；
> 最后会产生Checkpoint expired before completing.的错误，导致job重启，从而导致从上一个checkpoint恢复，然后重复消费数据，又导致checkpoint超时，死循环了。
> 
> 
> 不知道有什么好办法解决该问题。
> 多谢~
>