You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Swathi Mocharla <sw...@gmail.com> on 2022/12/08 17:23:27 UTC

Brokers not re-joining the ISR list and stuck at started until all the brokers restart

Hi,
We are upgrading a 3 broker cluster (1001,1002,1003) from 3.1.0 to 3.2.0.
During upgrade, it is noticed that when 1003 is restarted, it doesn't join
back the ISR list and the broker is stuck. Same is the case with 1002.
Only when 1001 is restarted, 1003,1002 re-join the ISR list and start
replicating data.

{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.386", "timezone":"UTC",
"log":{"message":"main - kafka.server.KafkaServer - [KafkaServer
id=1003] started"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.442", "timezone":"UTC",
"log":{"message":"data-plane-kafka-request-handler-1 -
state.change.logger - [Broker id=1003] Add 397 partitions and deleted
0 partitions from metadata cache in response to UpdateMetadata request
sent by controller 1002 epoch 18 with correlation id 0"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.448", "timezone":"UTC",
"log":{"message":"BrokerToControllerChannelManager broker=1003
name=alterIsr - kafka.server.BrokerToControllerRequestThread -
[BrokerToControllerChannelManager broker=1003 name=alterIsr]: Recorded
new controller, from now on will use broker
kf-pl47-me8-1.kf-pl47-me8-headless.nc0968-admin-ns.svc.cluster.local:9092
(id: 1002 rack: null)"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"ERROR",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.451", "timezone":"UTC",
"log":{"message":"data-plane-kafka-network-thread-1003-ListenerName(PLAINTEXT)-PLAINTEXT-1
- kafka.network.Processor - Closing socket for
192.168.216.11:9092-192.168.199.100:53778-0 because of error"}}
org.apache.kafka.common.errors.InvalidRequestException: Error getting
request for apiKey: LEADER_AND_ISR, apiVersion: 6, connectionId:
192.168.216.11:9092-192.168.199.100:53778-0, listenerName:
ListenerName(PLAINTEXT), principal: User:ANONYMOUS
org.apache.kafka.common.errors.InvalidRequestException: Error getting
request for apiKey: LEADER_AND_ISR, apiVersion: 6, connectionId:
192.168.216.11:9092-192.168.235.153:46282-461, listenerName:
ListenerName(PLAINTEXT), principal: User:ANONYMOUS
Caused by: org.apache.kafka.common.errors.UnsupportedVersionException:
Can't read version 6 of LeaderAndIsrTopicState
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:12:50.916", "timezone":"UTC",
"log":{"message":"controller-event-thread -
kafka.controller.KafkaController - [Controller id=1003] 1003
successfully elected as the controller. Epoch incremented to 20 and
epoch zk version is now 20"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:12:50.917", "timezone":"UTC",
"log":{"message":"controller-event-thread -
kafka.controller.KafkaController - [Controller id=1003] Registering
handlers"}}

Note: Unclean leader election is not enabled.

This possibly was introduced by KAFKA-13587
<https://issues.apache.org/jira/browse/KAFKA-13587>.

In the below snapshot during the upgrade, at 16:05:15 UTC 2022, 1001 was
restarting and both 1002 and 1003 were already up and running (after the
upgrade from 3.1.0 to 3.2.0), but did not manage to re-join the ISRs.

Wed Dec  7 16:05:15 UTC 2022
Topic: test     TopicId: L6Yj_Nf9RrirNhFQzvXODw PartitionCount: 2
 ReplicationFactor: 3    Configs:
compression.type=producer,min.insync.replicas=1,cleanup.policy=delete,flush.ms=1000,segment.bytes=100000000,flush.messages=10000,max.message.bytes=1000012,index.interval.bytes=4096,unclean.leader.election.enable=false,retention.bytes=1000000000,segment.index.bytes=10485760
        Topic: test     Partition: 0    Leader: none    Replicas:
1002,1003,1001        Isr: 1001
        Topic: test     Partition: 1    Leader: none    Replicas:
1001,1002,1003        Isr: 1001
Wed Dec  7 16:05:33 UTC 2022
Topic: test     TopicId: L6Yj_Nf9RrirNhFQzvXODw PartitionCount: 2
 ReplicationFactor: 3    Configs:
compression.type=producer,min.insync.replicas=1,cleanup.policy=delete,flush.ms=1000,segment.bytes=100000000,flush.messages=10000,max.message.bytes=1000012,index.interval.bytes=4096,unclean.leader.election.enable=false,retention.bytes=1000000000,segment.index.bytes=10485760
        Topic: test     Partition: 0    Leader: 1001    Replicas:
1002,1003,1001        Isr: 1001,1002,1003
        Topic: test     Partition: 1    Leader: 1001    Replicas:
1001,1002,1003        Isr: 1001,1002,1003

Is there anything the user needs to do explicitly to work around this issue?

Thank you,

Swathi