You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Rishank Chandra Puram (Jira)" <ji...@apache.org> on 2020/07/08 22:10:05 UTC
[jira] [Comment Edited] (KAFKA-10250) Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds

    [ https://issues.apache.org/jira/browse/KAFKA-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154043#comment-17154043 ] 

Rishank Chandra Puram edited comment on KAFKA-10250 at 7/8/20, 10:09 PM:
-------------------------------------------------------------------------

1. When the issue occurs and once broker connects back with other broker, we're able to produce/consume messages to/from Kafka using a simple command line.[even though the affected broker is not restarted]
2. The issue occurs suddenly in broker 1029 but also could occur in other broker, this behavior it's completely random, as we had issues with two different brokers at different times

[2020-06-27 16:14:23,800] INFO [GroupCoordinator 1029]: Member consumer-1-3dfeffb9-c8ca-4840-878e-3126e9750015 in group authapiprod02062020 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-06-27 16:14:23,802] INFO [GroupCoordinator 1029]: Stabilized group authapiprod02062020 generation 178 (__consumer_offsets-40) (kafka.coordinator.group.GroupCoordinator)
[2020-06-27 16:14:23,805] INFO [GroupCoordinator 1029]: Assignment received from leader for group authapiprod02062020 for generation 178 (kafka.coordinator.group.GroupCoordinator)
[2020-06-27 16:14:25,494] INFO [Partition topic_rejected_test-30 broker=1029] Shrinking ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)

After 30 seconds we see expanding messages:

[2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,375] INFO [Partition topic_analysis-33 broker=1029] Expanding ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,387] INFO [Partition topic_rejected-9 broker=1029] Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)


We reviewed during the issue: 

- lsof --> 75000 of 200000
- No high CPU found in grafana during the issue
- Thread dumps are not showing any stuck threads
- iostat looks fine
- During the issue we check for --under-replicated-partitions but after the broker 1029 it's expanding, no under replicated partitions seen
- We reviewed gc log files and the memory looks fine, we currently have 12g configured and the liveset is about 4g
- netstat output is not showing any abnormal behavior, this is from broker 1029:

    50659539242 total packets received
    52633502320 requests sent out
    88095112 segments retransmited
    0 receive buffer errors
    0 send buffer errors
    379 resets received for embryonic SYN_RECV sockets
    629655 TCP sockets finished time wait in fast timer
    550847 delayed acks further delayed because of locked socket
    11151 times the listen queue of a socket overflowed
    11151 SYNs to LISTEN sockets dropped


was (Author: rishankchandra):
1. When the issue occurs, they're able to produce/consume messages to/from Kafka using a simple command line.
2. The issue occurs suddenly in broker 1029 but also could occur in other broker, this behavior it's completely random, as we had issues with two different brokers at different times

[2020-06-27 16:14:23,800] INFO [GroupCoordinator 1029]: Member consumer-1-3dfeffb9-c8ca-4840-878e-3126e9750015 in group authapiprod02062020 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-06-27 16:14:23,802] INFO [GroupCoordinator 1029]: Stabilized group authapiprod02062020 generation 178 (__consumer_offsets-40) (kafka.coordinator.group.GroupCoordinator)
[2020-06-27 16:14:23,805] INFO [GroupCoordinator 1029]: Assignment received from leader for group authapiprod02062020 for generation 178 (kafka.coordinator.group.GroupCoordinator)
[2020-06-27 16:14:25,494] INFO [Partition topic_rejected_test-30 broker=1029] Shrinking ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)

After 30 seconds we see expanding messages:

[2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,375] INFO [Partition topic_analysis-33 broker=1029] Expanding ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,387] INFO [Partition topic_rejected-9 broker=1029] Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)


We reviewed during the issue: 

- lsof --> 75000 of 200000
- No high CPU found in grafana during the issue
- Thread dumps are not showing any stuck threads
- iostat looks fine
- During the issue we check for --under-replicated-partitions but after the broker 1029 it's expanding, no under replicated partitions seen
- We reviewed gc log files and the memory looks fine, we currently have 12g configured and the liveset is about 4g
- netstat output is not showing any abnormal behavior, this is from broker 1029:

    50659539242 total packets received
    52633502320 requests sent out
    88095112 segments retransmited
    0 receive buffer errors
    0 send buffer errors
    379 resets received for embryonic SYN_RECV sockets
    629655 TCP sockets finished time wait in fast timer
    550847 delayed acks further delayed because of locked socket
    11151 times the listen queue of a socket overflowed
    11151 SYNs to LISTEN sockets dropped

> Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10250
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10250
>             Project: Kafka
>          Issue Type: Bug
>          Components: build, config, consumer, controller, log, producer 
>    Affects Versions: 2.0.0
>            Reporter: Rishank Chandra Puram
>            Priority: Major
>
> The following the summary/overview of the whole issue. Can you please help us look into the below and let us know you thoughts on what caused the issue? And how to mitigate this in the future
> Issue:
> 1.	All of a sudden all the other brokers in cluster report have issues with one of the broker as below
>             
> Error for good broker [broker: broker2, brokerID: 1030]
> [2020-06-27 16:14:53,367] WARN [ReplicaFetcher replicaId=1030, leaderId=1029, fetcherId=13] Error in response for fetch request (type=FetchRequest, replicaId=1030, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1901394956, epoch=1018699)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 1029 was disconnected before the response was read
>         at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
>         at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
>         at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
>         at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
>         at kafka.server.AbstractFetcherThread.prabcssFetchRequest(AbstractFetcherThread.scala:149)
>         at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
> Error from affected broker [broker: broker6, brokerID: 1029]
> [2020-06-27 16:14:25,744] INFO [Partition topic_test2-33 broker=1029] Shrinking ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)
> [2020-06-27 16:14:25,752] INFO [Partition topic_a_restate-34 broker=1029] Shrinking ISR from 1027,1029,1028 to 1029 (kafka.cluster.Partition)
> [2020-06-27 16:14:25,760] INFO [Partition topic_f_restate-39 broker=1029] Shrinking ISR from 1026,1029,1025 to 1029,1025 (kafka.cluster.Partition)
> [2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
> [2020-06-27 16:14:26,683] INFO [ProducerStateManager partition=topic_abc_f-21] Writing producer snapshot at offset 1509742173 (kafka.log.ProducerStateManager)
> [2020-06-27 16:14:26,683] INFO [Log partition=topic_abc_f-21, dir=/hadoop-e/kafka/data1] Rolled new log segment at offset 1509742173 in 1 ms. (kafka.log.Log)
> [2020-06-27 16:14:55,375] INFO [Partition topic_test2-33 broker=1029] Expanding ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
> [2020-06-27 16:14:55,387] INFO [Partition test-topic-analysis-9 broker=1029] Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)
>  
> 2.	Entire kafka cluster becomes stable within few minutes
>  
> Trace for good broker [broker: broker2, brokerID: 1030]
> [2020-06-27 16:20:14,861] INFO Deleted time index /hadoop-g/kafka/data1/topic-analysis-0/00000000009100172512.timeindex.deleted. (kafka.log.LogSegment)
> [2020-06-27 16:20:14,882] INFO [Log partition=topic-analysis-4, dir=/hadoop-b/kafka/data1] Deleting segment 9100010843 (kafka.log.Log)
> [2020-06-27 16:20:14,883] INFO Deleted log /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.log.deleted. (kafka.log.LogSegment)
> [2020-06-27 16:20:14,897] INFO Deleted offset index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.index.deleted. (kafka.log.LogSegment)
> [2020-06-27 16:20:14,897] INFO Deleted time index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.timeindex.deleted. (kafka.log.LogSegment)
>  
> Trace from affected broker [broker: broker6, brokerID: 1029]
> [2020-06-27 16:21:01,552] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9099830344, size 1073733482] for deletion. (kafka.log.Log)
> [2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9100082067, size 1073738790] for deletion. (kafka.log.Log)
> [2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Incrementing log start offset to 9100334713 (kafka.log.Log)
> [2020-06-27 16:21:01,581] INFO Cleared earliest 0 entries from epoch cache based on passed offset 9100334713 leaving 1 in EpochFile for partition topic-analysis-22 (kafka.server.epoch.LeaderEpochFileCache)
> [2020-06-27 16:22:00,175] INFO [Log partition=topic_def_c-1, dir=/hadoop-j/kafka/data1] Deleting segment 1628853412 (kafka.log.Log)
> [2020-06-27 16:22:00,175] INFO Deleted log /hadoop-j/kafka/data1/topic_def_c-1/00000000001628853412.log.deleted. (kafka.log.LogSegment)
>  
> 3.	Once the disconnect would happen all the producer jobs will fail stating "Caused by: org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for topic-analysis-10"
> a.	If we describe the topics, all the topic partitions which the jobs are complaining would have the leader on the affected broker [here in this case topic-analysis-10 has leader on broker 1029]
> 4.	This would continue until the affected broker was restarted. Once the broker is restarted, all the jobs would run fine
> 5.	During the time in between the broker getting affected and we restart the broker
> a.	We see lot of ISR shrinks and Expansion happening as per Grafana. This ISR expansions/shrinks would be observed until the affected broker was restarted. But the Shrinks/Expands cannot be  seen in logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)