You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by "zhangzq@eastcom-sw.com" <zh...@eastcom-sw.com> on 2023/05/04 06:54:40 UTC

checkpoint Kafka Offset commit failed

hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is not available  

查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?

flink 日志如下:
2023-05-04 11:31:02,636 WARN  org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] - Failed to commit consumer offsets for checkpoint 69153
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.
Caused by: org.apache.kafka.common.errors.CoordinatorNotAvailableException: The coordinator is not available.

Re: Re: checkpoint Kafka Offset commit failed

Posted by "zhangzq@eastcom-sw.com" <zh...@eastcom-sw.com>.
    hi, 查看过kafka broker是没重启过,一直运行的,这边尝试升级下kafka版本看看

    当前版本 kafka-clients-2.6.2、 kafkaSever 2.11-2.1.1
     


 
From: Matt Wang
Date: 2023-05-06 21:13
To: user-zh@flink.apache.org
Subject: Re: checkpoint Kafka Offset commit failed
hi,这个报错看着是一个可以重试的异常,不过 Flink 里并没有对这个异常支持相应的重试逻辑 [1]/[2],只是打印了异常及记录相应的 metrics,你的作业已经开启了 cp,这个 WARN 日志实际上没有影响,社区之前也有过关于这个问题讨论[3]/[4],如果这个错误是因为 kafka broker 重启导致的,可以尝试参考 [4] 升级 kafka 版本试一下。
 

1. https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/reader/KafkaPartitionSplitReader.java#L249
2. https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/reader/KafkaSourceReader.java#L149
3. https://issues.apache.org/jira/browse/FLINK-25293
4. https://issues.apache.org/jira/browse/FLINK-28060
 
 
 
 
 
 
--
 
Best,
Matt Wang
 
 
---- Replied Message ----
| From | zhangzq@eastcom-sw.com<zh...@eastcom-sw.com> |
| Date | 05/6/2023 09:19 |
| To | user-zh<us...@flink.apache.org> |
| Subject | Re: Re: checkpoint Kafka Offset commit failed |
hi, 感谢解答~
 
flink 集群跟kafka集群都在同个网段,检查过网络情况是正常的
在flink1.14中,隔几天出现一次 Time should be non negative 异常,自动重启任务后 也是可以正常自动提交偏移量
 
java.lang.IllegalArgumentException: Time should be non negative
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
at org.apache.flink.runtime.throughput.ThroughputEMA.calculateThroughput(ThroughputEMA.java:44)
at org.apache.flink.runtime.throughput.ThroughputCalculator.calculateThroughput(ThroughputCalculator.java:80)
at org.apache.flink.streaming.runtime.tasks.StreamTask.debloat(StreamTask.java:792)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$4(StreamTask.java:784)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:748)
 
From: Shammon FY
Date: 2023-05-05 09:48
To: user-zh
Subject: Re: checkpoint Kafka Offset commit failed
Hi
 
看起来像是网络问题导致flink作业source节点连接kafka失败,可以检查一下kafka集群的网络或者flink作业source节点的网络是否有问题
 
Best,
Shammon FY
 
On Fri, May 5, 2023 at 9:41 AM Leonard Xu <xb...@gmail.com> wrote:
 
可以发送任意内容的邮件到  user-zh-unsubscribe@flink.apache.org   取消订阅来自
user-zh@flink.apache.org  邮件列表的邮件,邮件列表的订阅管理,可以参考[1]
 
祝好,
Leonard
[1]
https://flink.apache.org/zh/community/#%e9%82%ae%e4%bb%b6%e5%88%97%e8%a1%a8
 
2023年5月4日 下午9:00,wuzhongxiu <go...@163.com> 写道:
 
退订
 
 
 
| |
go574161@163.com
|
|
邮箱:go574161@163.com
|
 
 
 
 
---- 回复的原邮件 ----
| 发件人 | zhangzq@eastcom-sw.com |
| 日期 | 2023年05月04日 14:54 |
| 收件人 | user-zh<us...@flink.apache.org> |
| 抄送至 | |
| 主题 | checkpoint Kafka Offset commit failed |
hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is
not available
 
查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink
job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?
 
flink 日志如下:
2023-05-04 11:31:02,636 WARN
org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] -
Failed to commit consumer offsets for checkpoint 69153
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset
commit failed with a retriable exception. You should retry committing the
latest consumed offsets.
Caused by:
org.apache.kafka.common.errors.CoordinatorNotAvailableException: The
coordinator is not available.
 
 

Re: checkpoint Kafka Offset commit failed

Posted by Matt Wang <wa...@163.com>.
hi,这个报错看着是一个可以重试的异常,不过 Flink 里并没有对这个异常支持相应的重试逻辑 [1]/[2],只是打印了异常及记录相应的 metrics,你的作业已经开启了 cp,这个 WARN 日志实际上没有影响,社区之前也有过关于这个问题讨论[3]/[4],如果这个错误是因为 kafka broker 重启导致的,可以尝试参考 [4] 升级 kafka 版本试一下。


1. https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/reader/KafkaPartitionSplitReader.java#L249
2. https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/reader/KafkaSourceReader.java#L149
3. https://issues.apache.org/jira/browse/FLINK-25293
4. https://issues.apache.org/jira/browse/FLINK-28060






--

Best,
Matt Wang


---- Replied Message ----
| From | zhangzq@eastcom-sw.com<zh...@eastcom-sw.com> |
| Date | 05/6/2023 09:19 |
| To | user-zh<us...@flink.apache.org> |
| Subject | Re: Re: checkpoint Kafka Offset commit failed |
hi, 感谢解答~

flink 集群跟kafka集群都在同个网段,检查过网络情况是正常的
在flink1.14中,隔几天出现一次 Time should be non negative 异常,自动重启任务后 也是可以正常自动提交偏移量

java.lang.IllegalArgumentException: Time should be non negative
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
at org.apache.flink.runtime.throughput.ThroughputEMA.calculateThroughput(ThroughputEMA.java:44)
at org.apache.flink.runtime.throughput.ThroughputCalculator.calculateThroughput(ThroughputCalculator.java:80)
at org.apache.flink.streaming.runtime.tasks.StreamTask.debloat(StreamTask.java:792)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$4(StreamTask.java:784)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:748)

From: Shammon FY
Date: 2023-05-05 09:48
To: user-zh
Subject: Re: checkpoint Kafka Offset commit failed
Hi

看起来像是网络问题导致flink作业source节点连接kafka失败,可以检查一下kafka集群的网络或者flink作业source节点的网络是否有问题

Best,
Shammon FY

On Fri, May 5, 2023 at 9:41 AM Leonard Xu <xb...@gmail.com> wrote:

可以发送任意内容的邮件到  user-zh-unsubscribe@flink.apache.org   取消订阅来自
user-zh@flink.apache.org  邮件列表的邮件,邮件列表的订阅管理,可以参考[1]

祝好,
Leonard
[1]
https://flink.apache.org/zh/community/#%e9%82%ae%e4%bb%b6%e5%88%97%e8%a1%a8

2023年5月4日 下午9:00,wuzhongxiu <go...@163.com> 写道:

退订



| |
go574161@163.com
|
|
邮箱:go574161@163.com
|




---- 回复的原邮件 ----
| 发件人 | zhangzq@eastcom-sw.com |
| 日期 | 2023年05月04日 14:54 |
| 收件人 | user-zh<us...@flink.apache.org> |
| 抄送至 | |
| 主题 | checkpoint Kafka Offset commit failed |
hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is
not available

查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink
job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?

flink 日志如下:
2023-05-04 11:31:02,636 WARN
org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] -
Failed to commit consumer offsets for checkpoint 69153
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset
commit failed with a retriable exception. You should retry committing the
latest consumed offsets.
Caused by:
org.apache.kafka.common.errors.CoordinatorNotAvailableException: The
coordinator is not available.



Re: Re: checkpoint Kafka Offset commit failed

Posted by "zhangzq@eastcom-sw.com" <zh...@eastcom-sw.com>.
    hi, 感谢解答~

        flink 集群跟kafka集群都在同个网段,检查过网络情况是正常的
        在flink1.14中,隔几天出现一次 Time should be non negative 异常,自动重启任务后 也是可以正常自动提交偏移量

 java.lang.IllegalArgumentException: Time should be non negative
    at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
    at org.apache.flink.runtime.throughput.ThroughputEMA.calculateThroughput(ThroughputEMA.java:44)
    at org.apache.flink.runtime.throughput.ThroughputCalculator.calculateThroughput(ThroughputCalculator.java:80)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.debloat(StreamTask.java:792)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$4(StreamTask.java:784)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
    at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
    at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
    at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
    at java.lang.Thread.run(Thread.java:748)

From: Shammon FY
Date: 2023-05-05 09:48
To: user-zh
Subject: Re: checkpoint Kafka Offset commit failed
Hi
 
看起来像是网络问题导致flink作业source节点连接kafka失败,可以检查一下kafka集群的网络或者flink作业source节点的网络是否有问题
 
Best,
Shammon FY
 
On Fri, May 5, 2023 at 9:41 AM Leonard Xu <xb...@gmail.com> wrote:
 
> 可以发送任意内容的邮件到  user-zh-unsubscribe@flink.apache.org   取消订阅来自
> user-zh@flink.apache.org  邮件列表的邮件,邮件列表的订阅管理,可以参考[1]
>
> 祝好,
> Leonard
> [1]
> https://flink.apache.org/zh/community/#%e9%82%ae%e4%bb%b6%e5%88%97%e8%a1%a8
>
> > 2023年5月4日 下午9:00,wuzhongxiu <go...@163.com> 写道:
> >
> > 退订
> >
> >
> >
> > | |
> > go574161@163.com
> > |
> > |
> > 邮箱:go574161@163.com
> > |
> >
> >
> >
> >
> > ---- 回复的原邮件 ----
> > | 发件人 | zhangzq@eastcom-sw.com |
> > | 日期 | 2023年05月04日 14:54 |
> > | 收件人 | user-zh<us...@flink.apache.org> |
> > | 抄送至 | |
> > | 主题 | checkpoint Kafka Offset commit failed |
> > hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is
> not available
> >
> > 查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink
> job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?
> >
> > flink 日志如下:
> > 2023-05-04 11:31:02,636 WARN
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] -
> Failed to commit consumer offsets for checkpoint 69153
> > org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset
> commit failed with a retriable exception. You should retry committing the
> latest consumed offsets.
> > Caused by:
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The
> coordinator is not available.
>
>

Re: checkpoint Kafka Offset commit failed

Posted by Shammon FY <zj...@gmail.com>.
Hi

看起来像是网络问题导致flink作业source节点连接kafka失败,可以检查一下kafka集群的网络或者flink作业source节点的网络是否有问题

Best,
Shammon FY

On Fri, May 5, 2023 at 9:41 AM Leonard Xu <xb...@gmail.com> wrote:

> 可以发送任意内容的邮件到  user-zh-unsubscribe@flink.apache.org   取消订阅来自
> user-zh@flink.apache.org  邮件列表的邮件,邮件列表的订阅管理,可以参考[1]
>
> 祝好,
> Leonard
> [1]
> https://flink.apache.org/zh/community/#%e9%82%ae%e4%bb%b6%e5%88%97%e8%a1%a8
>
> > 2023年5月4日 下午9:00,wuzhongxiu <go...@163.com> 写道:
> >
> > 退订
> >
> >
> >
> > | |
> > go574161@163.com
> > |
> > |
> > 邮箱:go574161@163.com
> > |
> >
> >
> >
> >
> > ---- 回复的原邮件 ----
> > | 发件人 | zhangzq@eastcom-sw.com |
> > | 日期 | 2023年05月04日 14:54 |
> > | 收件人 | user-zh<us...@flink.apache.org> |
> > | 抄送至 | |
> > | 主题 | checkpoint Kafka Offset commit failed |
> > hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is
> not available
> >
> > 查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink
> job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?
> >
> > flink 日志如下:
> > 2023-05-04 11:31:02,636 WARN
> org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] -
> Failed to commit consumer offsets for checkpoint 69153
> > org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset
> commit failed with a retriable exception. You should retry committing the
> latest consumed offsets.
> > Caused by:
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The
> coordinator is not available.
>
>

Re: checkpoint Kafka Offset commit failed

Posted by Leonard Xu <xb...@gmail.com>.
可以发送任意内容的邮件到  user-zh-unsubscribe@flink.apache.org   取消订阅来自 user-zh@flink.apache.org  邮件列表的邮件,邮件列表的订阅管理,可以参考[1]

祝好,
Leonard
[1] https://flink.apache.org/zh/community/#%e9%82%ae%e4%bb%b6%e5%88%97%e8%a1%a8

> 2023年5月4日 下午9:00,wuzhongxiu <go...@163.com> 写道:
> 
> 退订
> 
> 
> 
> | |
> go574161@163.com
> |
> |
> 邮箱:go574161@163.com
> |
> 
> 
> 
> 
> ---- 回复的原邮件 ----
> | 发件人 | zhangzq@eastcom-sw.com |
> | 日期 | 2023年05月04日 14:54 |
> | 收件人 | user-zh<us...@flink.apache.org> |
> | 抄送至 | |
> | 主题 | checkpoint Kafka Offset commit failed |
> hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is not available  
> 
> 查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?
> 
> flink 日志如下:
> 2023-05-04 11:31:02,636 WARN  org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] - Failed to commit consumer offsets for checkpoint 69153
> org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.
> Caused by: org.apache.kafka.common.errors.CoordinatorNotAvailableException: The coordinator is not available.


回复:checkpoint Kafka Offset commit failed

Posted by wuzhongxiu <go...@163.com>.
退订



| |
go574161@163.com
|
|
邮箱:go574161@163.com
|




---- 回复的原邮件 ----
| 发件人 | zhangzq@eastcom-sw.com |
| 日期 | 2023年05月04日 14:54 |
| 收件人 | user-zh<us...@flink.apache.org> |
| 抄送至 | |
| 主题 | checkpoint Kafka Offset commit failed |
hi,请问在flink(1.14、1.16) checkpoint(10s)提交 kafka偏移量提示 The coordinator is not available  

查看kafka集群日志都是正常的,手动也可以正确提交偏移量,重启flink job后也可以正常提交,运行一段时间后又会失败,请问有参数可以优化一下吗?

flink 日志如下:
2023-05-04 11:31:02,636 WARN  org.apache.flink.connector.kafka.source.reader.KafkaSourceReader [] - Failed to commit consumer offsets for checkpoint 69153
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.
Caused by: org.apache.kafka.common.errors.CoordinatorNotAvailableException: The coordinator is not available.