You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by Zhefu PENG <pe...@gmail.com> on 2020/06/10 05:03:45 UTC

flink任务checkpoint无法完成snapshot,且报kafka异常

Hi all,

现在有一个简单的flink任务,大概chain在一起后的执行图为:
Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter
-> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed

但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。

部分报错信息如下:
2020-06-10 12:02:49,083 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201.
2020-06-10 12:04:47,898 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Decline
checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
c41f4811262db1c4c270b136571c8201 at
container_e27_1591466310139_21670_01_000006 @
hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
2020-06-10 12:04:47,899 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding
checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete
snapshot 1 for operator Source: Custom Source -> Map -> Source_Map ->
Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map ->
Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined.
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
at
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
at
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
Failed to send data to Kafka: The server disconnected before a response was
received.
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
at
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
... 18 more
Caused by: org.apache.kafka.common.errors.NetworkException: The server
disconnected before a response was received.
2020-06-10 12:04:47,913 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                 - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
failure threshold.
at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
at
org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

期望收到各位的回复和帮助。
Best,
Zhefu

Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by 李奇 <35...@qq.com>.
哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。

> 在 2020年6月10日,下午1:24,Zhefu PENG <pe...@gmail.com> 写道:
> 
> 补充一下,在TaskManager发现了如下错误日志:
> 
> 2020-06-10 12:44:40,688 ERROR
> org.apache.flink.streaming.runtime.tasks.StreamTask           - Error
> during disposal of stream operator.
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to
> send data to Kafka: Pending record count must be zero at this point: 5
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
> at
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
> at
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: Pending record count must be
> zero at this point: 5
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
> ... 8 more
> 
> 希望得到帮助,感谢!
> 
> 
> Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:
> 
>> Hi all,
>> 
>> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
>> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter
>> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed
>> 
>> 
>> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
>> 
>> 部分报错信息如下:
>> 2020-06-10 12:02:49,083 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
>> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201.
>> 2020-06-10 12:04:47,898 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Decline
>> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
>> c41f4811262db1c4c270b136571c8201 at
>> container_e27_1591466310139_21670_01_000006 @
>> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
>> 2020-06-10 12:04:47,899 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding
>> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
>> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
>> complete snapshot 1 for operator Source: Custom Source -> Map -> Source_Map
>> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map
>> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined.
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>> at
>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>> at
>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
>> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by:
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to
>> send data to Kafka: The server disconnected before a response was received.
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
>> at
>> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
>> at
>> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
>> at
>> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
>> at
>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
>> ... 18 more
>> Caused by: org.apache.kafka.common.errors.NetworkException: The server
>> disconnected before a response was received.
>> 2020-06-10 12:04:47,913 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>>                 - Trying to recover from a global failure.
>> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
>> failure threshold.
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
>> at
>> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> 
>> 期望收到各位的回复和帮助。
>> Best,
>> Zhefu
>> 
> 


Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by Leonard Xu <xb...@gmail.com>.
> 反馈一下,如果有朋友遇到类似的问题,可以参考,给这个问题做一个闭环。谢谢各位的关注和帮忙。
> 
> Best,
> Zhefu

谢谢 zhefu,  给你大大点赞,很社区的方式,相信这样的积累越多,小伙伴们都能学习到更多。

祝好,
Leonard Xu
 


> 
> LakeShen <sh...@gmail.com> 于2020年6月12日周五 上午9:49写道:
> 
>> Hi ZheFu,
>> 
>> 可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink
>> 的数据是否都已经 Sink 到了 kafka.
>> 
>> 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack
>> 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。
>> 
>> 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。
>> 
>> Best,
>> LakeShen
>> 
>> Congxian Qiu <qc...@gmail.com> 于2020年6月11日周四 上午9:50写道:
>> 
>>> Hi
>>> 
>>> 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by:
>>> java.lang.IllegalStateException: Pending record count must be zero at
>> this
>>> point: 5”,需要看一下为什么会走到这里
>>> 
>>> Best,
>>> Congxian
>>> 
>>> 
>>> 李奇 <35...@qq.com> 于2020年6月10日周三 下午5:57写道:
>>> 
>>>> 
>>>> 
>>> 
>> 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。
>>>> 
>>>>> 在 2020年6月10日,下午1:24,Zhefu PENG <pe...@gmail.com> 写道:
>>>>> 
>>>>> 补充一下,在TaskManager发现了如下错误日志:
>>>>> 
>>>>> 2020-06-10 12:44:40,688 ERROR
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask           - Error
>>>>> during disposal of stream operator.
>>>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
>> Failed
>>>> to
>>>>> send data to Kafka: Pending record count must be zero at this point:
>> 5
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
>>>>> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
>>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: java.lang.IllegalStateException: Pending record count must
>>> be
>>>>> zero at this point: 5
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
>>>>> ... 8 more
>>>>> 
>>>>> 希望得到帮助,感谢!
>>>>> 
>>>>> 
>>>>> Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
>>>>>> Source: Custom Source -> Map -> Source_Map -> Empty_Filer ->
>>>> Field_Filter
>>>>>> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
>>>>>> 
>>>>>> 部分报错信息如下:
>>>>>> 2020-06-10 12:02:49,083 INFO
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
>>>> Triggering
>>>>>> checkpoint 1 @ 1591761769060 for job
>> c41f4811262db1c4c270b136571c8201.
>>>>>> 2020-06-10 12:04:47,898 INFO
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
>>> Decline
>>>>>> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
>>>>>> c41f4811262db1c4c270b136571c8201 at
>>>>>> container_e27_1591466310139_21670_01_000006 @
>>>>>> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
>>>>>> 2020-06-10 12:04:47,899 INFO
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
>>>> Discarding
>>>>>> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
>>>>>> complete snapshot 1 for operator Source: Custom Source -> Map ->
>>>> Source_Map
>>>>>> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map
>>> ->
>>>> Map
>>>>>> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was
>>> declined.
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>>>> at
>>>> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
>>>>>> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
>>>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>> Caused by:
>>>>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
>>> Failed
>>>> to
>>>>>> send data to Kafka: The server disconnected before a response was
>>>> received.
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
>>>>>> ... 18 more
>>>>>> Caused by: org.apache.kafka.common.errors.NetworkException: The
>> server
>>>>>> disconnected before a response was received.
>>>>>> 2020-06-10 12:04:47,913 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>                - Trying to recover from a global failure.
>>>>>> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
>>>> tolerable
>>>>>> failure threshold.
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
>>>>>> at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>> 
>>>>>> 期望收到各位的回复和帮助。
>>>>>> Best,
>>>>>> Zhefu
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by Congxian Qiu <qc...@gmail.com>.
Hi Zhefu

感谢你在邮件列表分享你的解决方法,这样其他人遇到类似问题也有一个参考。

Best,
Congxian


Zhefu PENG <pe...@gmail.com> 于2020年7月13日周一 下午7:51写道:

> Hi all,
>
> 这封邮件最开始发出已经一个月了,这一个月里尝试了很多朋友或者各位大佬的建议,目前经过一周末加上两个工作日的查看,问题看来是解决了。
>
>
> 问题的根本原因:Kafka集群的性能不足(怀疑是CPU负荷过大)。问题出现的时候线上kakfa集群只有七台机器,在排除所有别的原因以及能进行到的尝试方案后,决定进行扩容。扩到15台机器。目前来看,平稳运行,没有再报出类似错误。
>
> 反馈一下,如果有朋友遇到类似的问题,可以参考,给这个问题做一个闭环。谢谢各位的关注和帮忙。
>
> Best,
> Zhefu
>
> LakeShen <sh...@gmail.com> 于2020年6月12日周五 上午9:49写道:
>
> > Hi ZheFu,
> >
> > 可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink
> > 的数据是否都已经 Sink 到了 kafka.
> >
> > 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack
> > 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。
> >
> > 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。
> >
> > Best,
> > LakeShen
> >
> > Congxian Qiu <qc...@gmail.com> 于2020年6月11日周四 上午9:50写道:
> >
> > > Hi
> > >
> > > 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by:
> > > java.lang.IllegalStateException: Pending record count must be zero at
> > this
> > > point: 5”,需要看一下为什么会走到这里
> > >
> > > Best,
> > > Congxian
> > >
> > >
> > > 李奇 <35...@qq.com> 于2020年6月10日周三 下午5:57写道:
> > >
> > > >
> > > >
> > >
> >
> 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。
> > > >
> > > > > 在 2020年6月10日,下午1:24,Zhefu PENG <pe...@gmail.com> 写道:
> > > > >
> > > > > 补充一下,在TaskManager发现了如下错误日志:
> > > > >
> > > > > 2020-06-10 12:44:40,688 ERROR
> > > > > org.apache.flink.streaming.runtime.tasks.StreamTask           -
> Error
> > > > > during disposal of stream operator.
> > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
> > Failed
> > > > to
> > > > > send data to Kafka: Pending record count must be zero at this
> point:
> > 5
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
> > > > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > Caused by: java.lang.IllegalStateException: Pending record count
> must
> > > be
> > > > > zero at this point: 5
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
> > > > > ... 8 more
> > > > >
> > > > > 希望得到帮助,感谢!
> > > > >
> > > > >
> > > > > Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
> > > > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer ->
> > > > Field_Filter
> > > > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink:
> Unnamed
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> >
> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
> > > > >>
> > > > >> 部分报错信息如下:
> > > > >> 2020-06-10 12:02:49,083 INFO
> > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > > > Triggering
> > > > >> checkpoint 1 @ 1591761769060 for job
> > c41f4811262db1c4c270b136571c8201.
> > > > >> 2020-06-10 12:04:47,898 INFO
> > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > > Decline
> > > > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
> > > > >> c41f4811262db1c4c270b136571c8201 at
> > > > >> container_e27_1591466310139_21670_01_000006 @
> > > > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
> > > > >> 2020-06-10 12:04:47,899 INFO
> > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > > > Discarding
> > > > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
> > > > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
> > > > >> complete snapshot 1 for operator Source: Custom Source -> Map ->
> > > > Source_Map
> > > > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter ->
> Map
> > > ->
> > > > Map
> > > > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was
> > > declined.
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> > > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> > > > >> at
> > > >
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> > > > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > > > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > > > >> at java.lang.Thread.run(Thread.java:748)
> > > > >> Caused by:
> > > > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
> > > Failed
> > > > to
> > > > >> send data to Kafka: The server disconnected before a response was
> > > > received.
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
> > > > >> ... 18 more
> > > > >> Caused by: org.apache.kafka.common.errors.NetworkException: The
> > server
> > > > >> disconnected before a response was received.
> > > > >> 2020-06-10 12:04:47,913 INFO
> > > > org.apache.flink.runtime.jobmaster.JobMaster
> > > > >>                 - Trying to recover from a global failure.
> > > > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> > > > tolerable
> > > > >> failure threshold.
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
> > > > >> at
> > > >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > >> at java.lang.Thread.run(Thread.java:748)
> > > > >>
> > > > >> 期望收到各位的回复和帮助。
> > > > >> Best,
> > > > >> Zhefu
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
>

Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by Zhefu PENG <pe...@gmail.com>.
Hi all,

这封邮件最开始发出已经一个月了,这一个月里尝试了很多朋友或者各位大佬的建议,目前经过一周末加上两个工作日的查看,问题看来是解决了。

问题的根本原因:Kafka集群的性能不足(怀疑是CPU负荷过大)。问题出现的时候线上kakfa集群只有七台机器,在排除所有别的原因以及能进行到的尝试方案后,决定进行扩容。扩到15台机器。目前来看,平稳运行,没有再报出类似错误。

反馈一下,如果有朋友遇到类似的问题,可以参考,给这个问题做一个闭环。谢谢各位的关注和帮忙。

Best,
Zhefu

LakeShen <sh...@gmail.com> 于2020年6月12日周五 上午9:49写道:

> Hi ZheFu,
>
> 可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink
> 的数据是否都已经 Sink 到了 kafka.
>
> 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack
> 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。
>
> 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。
>
> Best,
> LakeShen
>
> Congxian Qiu <qc...@gmail.com> 于2020年6月11日周四 上午9:50写道:
>
> > Hi
> >
> > 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by:
> > java.lang.IllegalStateException: Pending record count must be zero at
> this
> > point: 5”,需要看一下为什么会走到这里
> >
> > Best,
> > Congxian
> >
> >
> > 李奇 <35...@qq.com> 于2020年6月10日周三 下午5:57写道:
> >
> > >
> > >
> >
> 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。
> > >
> > > > 在 2020年6月10日,下午1:24,Zhefu PENG <pe...@gmail.com> 写道:
> > > >
> > > > 补充一下,在TaskManager发现了如下错误日志:
> > > >
> > > > 2020-06-10 12:44:40,688 ERROR
> > > > org.apache.flink.streaming.runtime.tasks.StreamTask           - Error
> > > > during disposal of stream operator.
> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
> Failed
> > > to
> > > > send data to Kafka: Pending record count must be zero at this point:
> 5
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
> > > > at
> > > >
> > >
> >
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
> > > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > > > at java.lang.Thread.run(Thread.java:748)
> > > > Caused by: java.lang.IllegalStateException: Pending record count must
> > be
> > > > zero at this point: 5
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
> > > > at
> > > >
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
> > > > ... 8 more
> > > >
> > > > 希望得到帮助,感谢!
> > > >
> > > >
> > > > Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:
> > > >
> > > >> Hi all,
> > > >>
> > > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
> > > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer ->
> > > Field_Filter
> > > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed
> > > >>
> > > >>
> > > >>
> > >
> >
> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
> > > >>
> > > >> 部分报错信息如下:
> > > >> 2020-06-10 12:02:49,083 INFO
> > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > > Triggering
> > > >> checkpoint 1 @ 1591761769060 for job
> c41f4811262db1c4c270b136571c8201.
> > > >> 2020-06-10 12:04:47,898 INFO
> > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > Decline
> > > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
> > > >> c41f4811262db1c4c270b136571c8201 at
> > > >> container_e27_1591466310139_21670_01_000006 @
> > > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
> > > >> 2020-06-10 12:04:47,899 INFO
> > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > > Discarding
> > > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
> > > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
> > > >> complete snapshot 1 for operator Source: Custom Source -> Map ->
> > > Source_Map
> > > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map
> > ->
> > > Map
> > > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was
> > declined.
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> > > >> at
> > > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> > > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > > >> at java.lang.Thread.run(Thread.java:748)
> > > >> Caused by:
> > > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
> > Failed
> > > to
> > > >> send data to Kafka: The server disconnected before a response was
> > > received.
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
> > > >> ... 18 more
> > > >> Caused by: org.apache.kafka.common.errors.NetworkException: The
> server
> > > >> disconnected before a response was received.
> > > >> 2020-06-10 12:04:47,913 INFO
> > > org.apache.flink.runtime.jobmaster.JobMaster
> > > >>                 - Trying to recover from a global failure.
> > > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> > > tolerable
> > > >> failure threshold.
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
> > > >> at
> > > >>
> > >
> >
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
> > > >> at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > > >> at
> > > >>
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > > >> at
> > > >>
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > > >> at
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > >> at
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > >> at java.lang.Thread.run(Thread.java:748)
> > > >>
> > > >> 期望收到各位的回复和帮助。
> > > >> Best,
> > > >> Zhefu
> > > >>
> > > >
> > >
> > >
> >
>

Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by LakeShen <sh...@gmail.com>.
Hi ZheFu,

可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink
的数据是否都已经 Sink 到了 kafka.

也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack
还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。

具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。

Best,
LakeShen

Congxian Qiu <qc...@gmail.com> 于2020年6月11日周四 上午9:50写道:

> Hi
>
> 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by:
> java.lang.IllegalStateException: Pending record count must be zero at this
> point: 5”,需要看一下为什么会走到这里
>
> Best,
> Congxian
>
>
> 李奇 <35...@qq.com> 于2020年6月10日周三 下午5:57写道:
>
> >
> >
> 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。
> >
> > > 在 2020年6月10日,下午1:24,Zhefu PENG <pe...@gmail.com> 写道:
> > >
> > > 补充一下,在TaskManager发现了如下错误日志:
> > >
> > > 2020-06-10 12:44:40,688 ERROR
> > > org.apache.flink.streaming.runtime.tasks.StreamTask           - Error
> > > during disposal of stream operator.
> > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed
> > to
> > > send data to Kafka: Pending record count must be zero at this point: 5
> > > at
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > > at
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
> > > at
> > >
> >
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
> > > at
> > >
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> > > at
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
> > > at
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
> > > at
> > >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
> > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > > at java.lang.Thread.run(Thread.java:748)
> > > Caused by: java.lang.IllegalStateException: Pending record count must
> be
> > > zero at this point: 5
> > > at
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
> > > at
> > >
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
> > > ... 8 more
> > >
> > > 希望得到帮助,感谢!
> > >
> > >
> > > Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:
> > >
> > >> Hi all,
> > >>
> > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
> > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer ->
> > Field_Filter
> > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed
> > >>
> > >>
> > >>
> >
> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
> > >>
> > >> 部分报错信息如下:
> > >> 2020-06-10 12:02:49,083 INFO
> > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > Triggering
> > >> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201.
> > >> 2020-06-10 12:04:47,898 INFO
> > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> Decline
> > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
> > >> c41f4811262db1c4c270b136571c8201 at
> > >> container_e27_1591466310139_21670_01_000006 @
> > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
> > >> 2020-06-10 12:04:47,899 INFO
> > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> > Discarding
> > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
> > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
> > >> complete snapshot 1 for operator Source: Custom Source -> Map ->
> > Source_Map
> > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map
> ->
> > Map
> > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was
> declined.
> > >> at
> > >>
> >
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> > >> at
> > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> > >> at
> > >>
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > >> at java.lang.Thread.run(Thread.java:748)
> > >> Caused by:
> > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException:
> Failed
> > to
> > >> send data to Kafka: The server disconnected before a response was
> > received.
> > >> at
> > >>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > >> at
> > >>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
> > >> at
> > >>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
> > >> at
> > >>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
> > >> at
> > >>
> >
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
> > >> at
> > >>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
> > >> at
> > >>
> >
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
> > >> at
> > >>
> >
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
> > >> at
> > >>
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
> > >> at
> > >>
> >
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
> > >> ... 18 more
> > >> Caused by: org.apache.kafka.common.errors.NetworkException: The server
> > >> disconnected before a response was received.
> > >> 2020-06-10 12:04:47,913 INFO
> > org.apache.flink.runtime.jobmaster.JobMaster
> > >>                 - Trying to recover from a global failure.
> > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> > tolerable
> > >> failure threshold.
> > >> at
> > >>
> >
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
> > >> at
> > >>
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
> > >> at
> > >>
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
> > >> at
> > >>
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
> > >> at
> > >>
> >
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
> > >> at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >> at
> > >>
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > >> at
> > >>
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > >> at
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >> at
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >> at java.lang.Thread.run(Thread.java:748)
> > >>
> > >> 期望收到各位的回复和帮助。
> > >> Best,
> > >> Zhefu
> > >>
> > >
> >
> >
>

Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by Congxian Qiu <qc...@gmail.com>.
Hi

从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by:
java.lang.IllegalStateException: Pending record count must be zero at this
point: 5”,需要看一下为什么会走到这里

Best,
Congxian


李奇 <35...@qq.com> 于2020年6月10日周三 下午5:57写道:

>
> 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。
>
> > 在 2020年6月10日,下午1:24,Zhefu PENG <pe...@gmail.com> 写道:
> >
> > 补充一下,在TaskManager发现了如下错误日志:
> >
> > 2020-06-10 12:44:40,688 ERROR
> > org.apache.flink.streaming.runtime.tasks.StreamTask           - Error
> > during disposal of stream operator.
> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed
> to
> > send data to Kafka: Pending record count must be zero at this point: 5
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
> > at
> >
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
> > at
> >
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> > at
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
> > at
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
> > at
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
> > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: java.lang.IllegalStateException: Pending record count must be
> > zero at this point: 5
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
> > ... 8 more
> >
> > 希望得到帮助,感谢!
> >
> >
> > Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:
> >
> >> Hi all,
> >>
> >> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
> >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer ->
> Field_Filter
> >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed
> >>
> >>
> >>
> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
> >>
> >> 部分报错信息如下:
> >> 2020-06-10 12:02:49,083 INFO
> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> Triggering
> >> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201.
> >> 2020-06-10 12:04:47,898 INFO
> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Decline
> >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
> >> c41f4811262db1c4c270b136571c8201 at
> >> container_e27_1591466310139_21670_01_000006 @
> >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
> >> 2020-06-10 12:04:47,899 INFO
> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
> Discarding
> >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
> >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
> >> complete snapshot 1 for operator Source: Custom Source -> Map ->
> Source_Map
> >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map ->
> Map
> >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined.
> >> at
> >>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> >> at
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> >> at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> >> at java.lang.Thread.run(Thread.java:748)
> >> Caused by:
> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed
> to
> >> send data to Kafka: The server disconnected before a response was
> received.
> >> at
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> >> at
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
> >> at
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
> >> at
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
> >> at
> >>
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
> >> at
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
> >> at
> >>
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
> >> at
> >>
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
> >> at
> >>
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
> >> at
> >>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
> >> ... 18 more
> >> Caused by: org.apache.kafka.common.errors.NetworkException: The server
> >> disconnected before a response was received.
> >> 2020-06-10 12:04:47,913 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >>                 - Trying to recover from a global failure.
> >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable
> >> failure threshold.
> >> at
> >>
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
> >> at
> >>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
> >> at
> >>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
> >> at
> >>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
> >> at
> >>
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
> >> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >> at
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> at java.lang.Thread.run(Thread.java:748)
> >>
> >> 期望收到各位的回复和帮助。
> >> Best,
> >> Zhefu
> >>
> >
>
>

Re: flink任务checkpoint无法完成snapshot,且报kafka异常

Posted by Zhefu PENG <pe...@gmail.com>.
补充一下,在TaskManager发现了如下错误日志:

2020-06-10 12:44:40,688 ERROR
org.apache.flink.streaming.runtime.tasks.StreamTask           - Error
during disposal of stream operator.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to
send data to Kafka: Pending record count must be zero at this point: 5
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861)
at
org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Pending record count must be
zero at this point: 5
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834)
... 8 more

希望得到帮助,感谢!


Zhefu PENG <pe...@gmail.com> 于2020年6月10日周三 下午1:03写道:

> Hi all,
>
> 现在有一个简单的flink任务,大概chain在一起后的执行图为:
> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter
> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed
>
>
> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。
>
> 部分报错信息如下:
> 2020-06-10 12:02:49,083 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201.
> 2020-06-10 12:04:47,898 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Decline
> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job
> c41f4811262db1c4c270b136571c8201 at
> container_e27_1591466310139_21670_01_000006 @
> hdp1-hadoop-datanode-4.novalocal (dataPort=44778).
> 2020-06-10 12:04:47,899 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding
> checkpoint 1 of job c41f4811262db1c4c270b136571c8201.
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
> complete snapshot 1 for operator Source: Custom Source -> Map -> Source_Map
> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map
> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined.
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> at java.lang.Thread.run(Thread.java:748)
> Caused by:
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to
> send data to Kafka: The server disconnected before a response was received.
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98)
> at
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978)
> at
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
> at
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
> at
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402)
> ... 18 more
> Caused by: org.apache.kafka.common.errors.NetworkException: The server
> disconnected before a response was received.
> 2020-06-10 12:04:47,913 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                  - Trying to recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold.
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> 期望收到各位的回复和帮助。
> Best,
> Zhefu
>