You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Rishank Chandra Puram (Jira)" <ji...@apache.org> on 2020/07/08 21:57:05 UTC
[jira] [Created] (KAFKA-10250) Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds

Rishank Chandra Puram created KAFKA-10250:
---------------------------------------------

             Summary: Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds
                 Key: KAFKA-10250
                 URL: https://issues.apache.org/jira/browse/KAFKA-10250
             Project: Kafka
          Issue Type: Bug
          Components: build, config, consumer, controller, log, producer 
    Affects Versions: 2.0.0
            Reporter: Rishank Chandra Puram


The following the summary/overview of the whole issue. Can you please help us look into the below and let us know you thoughts on what caused the issue? And how to mitigate this in the future

Issue:
1.	All of a sudden all the other brokers in cluster report have issues with one of the broker as below
            
Error for good broker [broker: broker2, brokerID: 1030]
[2020-06-27 16:14:53,367] WARN [ReplicaFetcher replicaId=1030, leaderId=1029, fetcherId=13] Error in response for fetch request (type=FetchRequest, replicaId=1030, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1901394956, epoch=1018699)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1029 was disconnected before the response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
        at kafka.server.AbstractFetcherThread.prabcssFetchRequest(AbstractFetcherThread.scala:149)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
 
Error from affected broker [broker: broker6, brokerID: 1029]
[2020-06-27 16:14:25,744] INFO [Partition topic_test2-33 broker=1029] Shrinking ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)
[2020-06-27 16:14:25,752] INFO [Partition topic_a_restate-34 broker=1029] Shrinking ISR from 1027,1029,1028 to 1029 (kafka.cluster.Partition)
[2020-06-27 16:14:25,760] INFO [Partition topic_f_restate-39 broker=1029] Shrinking ISR from 1026,1029,1025 to 1029,1025 (kafka.cluster.Partition)
[2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:26,683] INFO [ProducerStateManager partition=topic_abc_f-21] Writing producer snapshot at offset 1509742173 (kafka.log.ProducerStateManager)
[2020-06-27 16:14:26,683] INFO [Log partition=topic_abc_f-21, dir=/hadoop-e/kafka/data1] Rolled new log segment at offset 1509742173 in 1 ms. (kafka.log.Log)
[2020-06-27 16:14:55,375] INFO [Partition topic_test2-33 broker=1029] Expanding ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,387] INFO [Partition test-topic-analysis-9 broker=1029] Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)
 
2.	Entire kafka cluster becomes stable within few minutes
 
Trace for good broker [broker: broker2, brokerID: 1030]
[2020-06-27 16:20:14,861] INFO Deleted time index /hadoop-g/kafka/data1/topic-analysis-0/00000000009100172512.timeindex.deleted. (kafka.log.LogSegment)
[2020-06-27 16:20:14,882] INFO [Log partition=topic-analysis-4, dir=/hadoop-b/kafka/data1] Deleting segment 9100010843 (kafka.log.Log)
[2020-06-27 16:20:14,883] INFO Deleted log /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.log.deleted. (kafka.log.LogSegment)
[2020-06-27 16:20:14,897] INFO Deleted offset index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.index.deleted. (kafka.log.LogSegment)
[2020-06-27 16:20:14,897] INFO Deleted time index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.timeindex.deleted. (kafka.log.LogSegment)
 
Trace from affected broker [broker: broker6, brokerID: 1029]
[2020-06-27 16:21:01,552] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9099830344, size 1073733482] for deletion. (kafka.log.Log)
[2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9100082067, size 1073738790] for deletion. (kafka.log.Log)
[2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Incrementing log start offset to 9100334713 (kafka.log.Log)
[2020-06-27 16:21:01,581] INFO Cleared earliest 0 entries from epoch cache based on passed offset 9100334713 leaving 1 in EpochFile for partition topic-analysis-22 (kafka.server.epoch.LeaderEpochFileCache)
[2020-06-27 16:22:00,175] INFO [Log partition=topic_def_c-1, dir=/hadoop-j/kafka/data1] Deleting segment 1628853412 (kafka.log.Log)
[2020-06-27 16:22:00,175] INFO Deleted log /hadoop-j/kafka/data1/topic_def_c-1/00000000001628853412.log.deleted. (kafka.log.LogSegment)
 
3.	Once the disconnect would happen all the producer jobs will fail stating "Caused by: org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for topic-analysis-10"
a.	If we describe the topics, all the topic partitions which the jobs are complaining would have the leader on the affected broker [here in this case global_auth-10 has leader on broker 1029]
4.	This would continue until the affected broker was restarted. Once the broker is restarted, all the jobs would run fine
5.	During the time in between the broker getting affected and we restart the broker
a.	We see lot of ISR shrinks and Expansion happening as per Grafana. This ISR expansions/shrinks would be oserved until the affected broker was restarted. But the Shrinks/Expands cannot be  seen in logs.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)