You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "shenguanghui@unionpay.com" <sh...@unionpay.com> on 2019/11/19 10:59:00 UTC
partition get underreplicated and stuck, descibe command shows the leader is a dead broker id
kafka partitions get underreplicated, with a single ISR, and doesn't recover.
I have 8 brokers and several topics with 3 replicas for every topic. broker id is from 0 to 7. One day broker 0 got a young gc for 3.29 seconds and after that some partitions reduce its isr from 3 to 1, the log here is:
[2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: Shrinking ISR from 0,1,2 to 0 (kafka.cluster.Partition)
[2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: Shrinking ISR from 0,1,2 to 0,1 (kafka.cluster.Partition)
there are many timeout exceptions on producers during the gc process. after a while, other 7 brokers say that consistently:
[2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to broker 0, request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__cons
umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), __consumer_offsets-15=(offset=78350976, logStartOffset=0, maxBytes=1048576), dcs_async_redis_to_db-7=(offset=758846267,
logStartOffset=757998253, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
the log is a shortcut of broker id 1 which is same like other brokers. what more strange is I tried to kill broker 0 but failed and killed it with -9 finally. and after I killed broker 0, the topic and partition [dcs_async_redis_to_db,7] also showed that its leader is broker 0 when I described the topics status on other broker with --describe command. I am sure that borker of id 0 had been killed at that time. Finally after I restarted the broker 0, the cluster return back to correct status, however there were some accidents during the process, but I think there was nothing related with the trouble what I am confused with.
I search the issues of kafka, related some are:
https://issues.apache.org/jira/browse/KAFKA-6582
https://issues.apache.org/jira/browse/KAFKA-4477
the issue 4477 shows that it has been fixed but I cannot find commit log or code or patch related. Beggar for your help. I have the kafka logs during the whole time if you want.
沈光辉
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区
回复: partition get underreplicated and stuck, descibe command shows the leader is a dead broker id
Posted by "shenguanghui@unionpay.com" <sh...@unionpay.com>.
I am sorry for that I forgot to tell you the version of kafka, which is kafka 0.11.0.
沈光辉
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区
发件人: shenguanghui@unionpay.com
发送时间: 2019-11-19 18:59
收件人: users
主题: partition get underreplicated and stuck, descibe command shows the leader is a dead broker id
kafka partitions get underreplicated, with a single ISR, and doesn't recover.
I have 8 brokers and several topics with 3 replicas for every topic. broker id is from 0 to 7. One day broker 0 got a young gc for 3.29 seconds and after that some partitions reduce its isr from 3 to 1, the log here is:
[2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: Shrinking ISR from 0,1,2 to 0 (kafka.cluster.Partition)
[2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: Shrinking ISR from 0,1,2 to 0,1 (kafka.cluster.Partition)
there are many timeout exceptions on producers during the gc process. after a while, other 7 brokers say that consistently:
[2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to broker 0, request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__cons
umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), __consumer_offsets-15=(offset=78350976, logStartOffset=0, maxBytes=1048576), dcs_async_redis_to_db-7=(offset=758846267,
logStartOffset=757998253, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
the log is a shortcut of broker id 1 which is same like other brokers. what more strange is I tried to kill broker 0 but failed and killed it with -9 finally. and after I killed broker 0, the topic and partition [dcs_async_redis_to_db,7] also showed that its leader is broker 0 when I described the topics status on other broker with --describe command. I am sure that borker of id 0 had been killed at that time. Finally after I restarted the broker 0, the cluster return back to correct status, however there were some accidents during the process, but I think there was nothing related with the trouble what I am confused with.
I search the issues of kafka, related some are:
https://issues.apache.org/jira/browse/KAFKA-6582
https://issues.apache.org/jira/browse/KAFKA-4477
the issue 4477 shows that it has been fixed but I cannot find commit log or code or patch related. Beggar for your help. I have the kafka logs during the whole time if you want.
沈光辉
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区