You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Daniel (Jira)" <ji...@apache.org> on 2021/10/27 09:42:00 UTC

[jira] [Created] (KAFKA-13407) Kafka controller out of service after ZK leader restart

Daniel created KAFKA-13407:
------------------------------

             Summary: Kafka controller out of service after ZK leader restart
                 Key: KAFKA-13407
                 URL: https://issues.apache.org/jira/browse/KAFKA-13407
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 2.8.1, 2.8.0
         Environment: Ubuntu 20.04
            Reporter: Daniel


When the Zookeeper leader disappears, a new instance becomes the leader, the instances need to reconnect to Zookeeper, but the Kafka "Controller" gets lost in limbo state after re-establishing connection.

See below for how I manage to reproduce this over and over.

*Prerequisites*

Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the `-L` flag.

Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is leader by checking

 
{code:java}
echo stat | nc -v localhost 2181
{code}
 

 

*Reproduce*

1. Stop the leader Zookeeper service.

2. Watch the logs of the Kafka Controller and ensure that it reconnects and registers again.

 
{code:java}
Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO Unable to read additional data from server sessionid 0x1f2a12870003, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. (org.apache.zookeeper.ClientCnxn)
Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO Opening socket connection to server zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 (org.apache.zookeeper.ClientCnxn)
Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient)
Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO Socket connection established, initiating session, client: /10.10.85.215:39338, server: zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 (org.apache.zookeeper.ClientCnxn)
Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO Session establishment complete on server zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, sessionid = 0x1f2a12870003, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO [ZooKeeperClient Kafka server] Reinitializing due to auth failure. (kafka.zookeeper.ZooKeeperClient)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO [PartitionStateMachine controllerId=1003] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO [ReplicaStateMachine controllerId=1003] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,556] INFO Processing notification(s) to /config/changes (kafka.common.ZkNodeChangeNotificationListener)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO [Controller id=1003] Resigned (kafka.controller.KafkaController)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO Creating /brokers/ids/1003 (is it secure? false) (kafka.zk.KafkaZkClient)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,560] INFO Updated cache from existing FinalizedFeaturesAndEpoch(features=Features{}, epoch=1) to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). (kafka.server.FinalizedFeatureCache)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO Stat of the created znode at /brokers/ids/1003 is: 128849019044,128849019044,1635323000996,1635323000996,1,0,0,34265559924739,389,0,128849019044
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]:  (kafka.zk.KafkaZkClient)
Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO Registered broker 1003 at path /brokers/ids/1003 with addresses: PLAINTEXT://kafka-tr-2.node.consul.lab.aws.blue.example.net:9093,SASL_SSL://kafka-tr-2.node.consul.lab.aws.blue.example.net:9092, czxid (broker epoch): 128849019044 (kafka.zk.KafkaZkClient)
{code}
 

 

Now it is in this limbo state but to notice the problem create a topic for example.

 
{code:java}
./kafka-topics.sh --zookeeper zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr --create --topic danols-ts-4 --partitions 10 --replication-factor 3
Created topic danols-ts-4.
{code}
 

 

Then when describing the topic it looks like this:

 
{code:java}
./kafka-topics.sh --zookeeper zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr --describe --topic danols-ts-4
Topic: danols-ts-4	PartitionCount: 10	ReplicationFactor: 3	Configs: 
	Topic: danols-ts-4	Partition: 0	Leader: none	Replicas: 1003,1001,1002	Isr: 
	Topic: danols-ts-4	Partition: 1	Leader: none	Replicas: 1001,1002,1003	Isr: 
	Topic: danols-ts-4	Partition: 2	Leader: none	Replicas: 1002,1003,1001	Isr: 
	Topic: danols-ts-4	Partition: 3	Leader: none	Replicas: 1003,1002,1001	Isr: 
	Topic: danols-ts-4	Partition: 4	Leader: none	Replicas: 1001,1003,1002	Isr: 
	Topic: danols-ts-4	Partition: 5	Leader: none	Replicas: 1002,1001,1003	Isr: 
	Topic: danols-ts-4	Partition: 6	Leader: none	Replicas: 1003,1001,1002	Isr: 
	Topic: danols-ts-4	Partition: 7	Leader: none	Replicas: 1001,1002,1003	Isr: 
	Topic: danols-ts-4	Partition: 8	Leader: none	Replicas: 1002,1003,1001	Isr: 
	Topic: danols-ts-4	Partition: 9	Leader: none	Replicas: 1003,1002,1001	Isr: 

{code}
 

 

If I describe using the --bootstrap-server flag instead it doesn't exist
{code:java}
./kafka-topics.sh --bootstrap-server localhost:9093 --describe --topic danols-ts-4
Error while executing topic command : Topic 'danols-ts-4' does not exist as expected
[2021-10-27 09:24:15,691] ERROR java.lang.IllegalArgumentException: Topic 'danols-ts-4' does not exist as expected
	at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:542)
	at kafka.admin.TopicCommand$AdminClientTopicService.describeTopic(TopicCommand.scala:317)
	at kafka.admin.TopicCommand$.main(TopicCommand.scala:69)
	at kafka.admin.TopicCommand.main(TopicCommand.scala)
 (kafka.admin.TopicCommand$)
{code}
If I restart the Kafka service on a broker that is *not* "Controller" it doesn't come up again. It just spams the following in the log for all different partitions and it's not possible to connect to it.
{code:java}
Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0.
Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:48,984] INFO [Admin Manager on Broker 1001]: Error processing create topic request CreatableTopic(name='example-message-consumer', numPartitions=10, replicationFactor=3, assignments=[], configs=[]) (kafka.server.ZkAdminManager)
Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0.
Oct 27 09:27:49 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:49,021] INFO [Admin Manager on Broker 1001]: Error processing create topic request CreatableTopic(name='__consumer_offsets', numPartitions=50, replicationFactor=3, assignments=[], configs=[CreateableTopicConfig(name='compression.type', value='producer'), CreateableTopicConfig(name='cleanup.policy', value='compact'), CreateableTopicConfig(name='segment.bytes', value='104857600')]) (kafka.server.ZkAdminManager)
{code}
To get out of this state I simply restart the Kafka service on the "Controller". It will make another become "Controller" and all brokers becomes OK again and the newly created topic gets all expected ISRs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)