You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "shenguanghui@unionpay.com" <sh...@unionpay.com> on 2019/11/19 10:59:00 UTC

partition get underreplicated and stuck, descibe command shows the leader is a dead broker id

kafka partitions get underreplicated, with a single ISR, and doesn't recover. 
I have 8 brokers and several topics with 3 replicas for every topic. broker id is from 0 to 7. One day broker 0 got a young gc for 3.29 seconds and after that some partitions reduce its isr from 3 to 1, the log here is:

[2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: Shrinking ISR from 0,1,2 to 0 (kafka.cluster.Partition)
[2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: Shrinking ISR from 0,1,2 to 0,1 (kafka.cluster.Partition)
 there are many timeout exceptions on producers during the gc process. after a while,  other 7 brokers say that consistently:

[2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to broker 0, request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__cons
umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), __consumer_offsets-15=(offset=78350976, logStartOffset=0, maxBytes=1048576), dcs_async_redis_to_db-7=(offset=758846267,
 logStartOffset=757998253, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

the log is a shortcut of broker id 1 which is same like other brokers.  what more strange is I tried to kill broker 0 but failed and killed it with -9 finally.  and after I killed broker 0, the topic and partition  [dcs_async_redis_to_db,7]  also showed that its leader is broker 0 when I described the topics status on other broker with --describe command. I am sure that borker of id 0 had been killed at that time. Finally after I restarted the broker 0, the cluster return back to correct status, however there were some accidents during the process, but I think there was nothing related with the trouble what I am confused with.

I search the issues of kafka, related some are:
https://issues.apache.org/jira/browse/KAFKA-6582
https://issues.apache.org/jira/browse/KAFKA-4477

the issue 4477 shows that it has been fixed but I cannot find commit log or code or patch related. Beggar for your help. I have the kafka logs during the whole time if you want.



沈光辉 
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区


回复: partition get underreplicated and stuck, descibe command shows the leader is a dead broker id

Posted by "shenguanghui@unionpay.com" <sh...@unionpay.com>.
I am sorry for that I forgot to tell you the version of kafka, which is kafka 0.11.0.



沈光辉 
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区

 
发件人: shenguanghui@unionpay.com
发送时间: 2019-11-19 18:59
收件人: users
主题: partition get underreplicated and stuck, descibe command shows the leader is a dead broker id

kafka partitions get underreplicated, with a single ISR, and doesn't recover. 
I have 8 brokers and several topics with 3 replicas for every topic. broker id is from 0 to 7. One day broker 0 got a young gc for 3.29 seconds and after that some partitions reduce its isr from 3 to 1, the log here is:

[2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: Shrinking ISR from 0,1,2 to 0 (kafka.cluster.Partition)
[2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: Shrinking ISR from 0,1,2 to 0,1 (kafka.cluster.Partition)
 there are many timeout exceptions on producers during the gc process. after a while,  other 7 brokers say that consistently:

[2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to broker 0, request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__cons
umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), __consumer_offsets-15=(offset=78350976, logStartOffset=0, maxBytes=1048576), dcs_async_redis_to_db-7=(offset=758846267,
 logStartOffset=757998253, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

the log is a shortcut of broker id 1 which is same like other brokers.  what more strange is I tried to kill broker 0 but failed and killed it with -9 finally.  and after I killed broker 0, the topic and partition  [dcs_async_redis_to_db,7]  also showed that its leader is broker 0 when I described the topics status on other broker with --describe command. I am sure that borker of id 0 had been killed at that time. Finally after I restarted the broker 0, the cluster return back to correct status, however there were some accidents during the process, but I think there was nothing related with the trouble what I am confused with.

I search the issues of kafka, related some are:
https://issues.apache.org/jira/browse/KAFKA-6582
https://issues.apache.org/jira/browse/KAFKA-4477

the issue 4477 shows that it has been fixed but I cannot find commit log or code or patch related. Beggar for your help. I have the kafka logs during the whole time if you want.



沈光辉 
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区