You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jun Rao (Jira)" <ji...@apache.org> on 2022/02/24 22:11:00 UTC

[jira] [Commented] (KAFKA-13407) Kafka controller out of service after ZK leader restart

    [ https://issues.apache.org/jira/browse/KAFKA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497760#comment-17497760 ] 

Jun Rao commented on KAFKA-13407:
---------------------------------

[~Olsson] : Thanks for the reply. I think we fixed this issue in https://issues.apache.org/jira/browse/KAFKA-13461. Basically, when there is no JAAS configured for ZK client and the ZK client tries to establish a new connection, the client will first receive an AUTH_FAIL event. However, this doesn't mean that the ZK client's session is gone since the client will retry the connection without auth, which typically succeeds. Previously, we mistakenly try to reinitialize the controller with the AUTH_FAIL event, which causes the controller to resign but not regain the controllership since the underlying session is still valid. https://issues.apache.org/jira/browse/KAFKA-1346 fixes that issue.

 

I am closing the issue for now. Feel free to reopen it if that doesn't fix the issue.

> Kafka controller out of service after ZK leader restart
> -------------------------------------------------------
>
>                 Key: KAFKA-13407
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13407
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.8.1
>         Environment: Ubuntu 20.04
>            Reporter: Daniel
>            Priority: Critical
>
> When the Zookeeper leader disappears, a new instance becomes the leader, the instances need to reconnect to Zookeeper, but the Kafka "Controller" gets lost in limbo state after re-establishing connection.
> See below for how I manage to reproduce this over and over.
> *Prerequisites*
> Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the `-L` flag.
> Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is leader by checking
>  
> {code:java}
> echo stat | nc -v localhost 2181
> {code}
>  
>  
> *Reproduce*
> 1. Stop the leader Zookeeper service.
> 2. Watch the logs of the Kafka Controller and ensure that it reconnects and registers again.
>  
> {code:java}
> Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO Unable to read additional data from server sessionid 0x1f2a12870003, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO Opening socket connection to server zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO Socket connection established, initiating session, client: /10.10.85.215:39338, server: zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO Session establishment complete on server zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, sessionid = 0x1f2a12870003, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO [ZooKeeperClient Kafka server] Reinitializing due to auth failure. (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO [PartitionStateMachine controllerId=1003] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO [ReplicaStateMachine controllerId=1003] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,556] INFO Processing notification(s) to /config/changes (kafka.common.ZkNodeChangeNotificationListener)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO [Controller id=1003] Resigned (kafka.controller.KafkaController)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO Creating /brokers/ids/1003 (is it secure? false) (kafka.zk.KafkaZkClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,560] INFO Updated cache from existing FinalizedFeaturesAndEpoch(features=Features{}, epoch=1) to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). (kafka.server.FinalizedFeatureCache)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO Stat of the created znode at /brokers/ids/1003 is: 128849019044,128849019044,1635323000996,1635323000996,1,0,0,34265559924739,389,0,128849019044
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]:  (kafka.zk.KafkaZkClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO Registered broker 1003 at path /brokers/ids/1003 with addresses: PLAINTEXT://kafka-tr-2.node.consul.lab.aws.blue.example.net:9093,SASL_SSL://kafka-tr-2.node.consul.lab.aws.blue.example.net:9092, czxid (broker epoch): 128849019044 (kafka.zk.KafkaZkClient)
> {code}
>  
>  
> Now it is in this limbo state but to notice the problem create a topic for example.
>  
> {code:java}
> ./kafka-topics.sh --zookeeper zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr --create --topic danols-ts-4 --partitions 10 --replication-factor 3
> Created topic danols-ts-4.
> {code}
>  
>  
> Then when describing the topic it looks like this:
>  
> {code:java}
> ./kafka-topics.sh --zookeeper zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr --describe --topic danols-ts-4
> Topic: danols-ts-4	PartitionCount: 10	ReplicationFactor: 3	Configs: 
> 	Topic: danols-ts-4	Partition: 0	Leader: none	Replicas: 1003,1001,1002	Isr: 
> 	Topic: danols-ts-4	Partition: 1	Leader: none	Replicas: 1001,1002,1003	Isr: 
> 	Topic: danols-ts-4	Partition: 2	Leader: none	Replicas: 1002,1003,1001	Isr: 
> 	Topic: danols-ts-4	Partition: 3	Leader: none	Replicas: 1003,1002,1001	Isr: 
> 	Topic: danols-ts-4	Partition: 4	Leader: none	Replicas: 1001,1003,1002	Isr: 
> 	Topic: danols-ts-4	Partition: 5	Leader: none	Replicas: 1002,1001,1003	Isr: 
> 	Topic: danols-ts-4	Partition: 6	Leader: none	Replicas: 1003,1001,1002	Isr: 
> 	Topic: danols-ts-4	Partition: 7	Leader: none	Replicas: 1001,1002,1003	Isr: 
> 	Topic: danols-ts-4	Partition: 8	Leader: none	Replicas: 1002,1003,1001	Isr: 
> 	Topic: danols-ts-4	Partition: 9	Leader: none	Replicas: 1003,1002,1001	Isr: 
> {code}
>  
>  
> If I describe using the --bootstrap-server flag instead it doesn't exist
> {code:java}
> ./kafka-topics.sh --bootstrap-server localhost:9093 --describe --topic danols-ts-4
> Error while executing topic command : Topic 'danols-ts-4' does not exist as expected
> [2021-10-27 09:24:15,691] ERROR java.lang.IllegalArgumentException: Topic 'danols-ts-4' does not exist as expected
> 	at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:542)
> 	at kafka.admin.TopicCommand$AdminClientTopicService.describeTopic(TopicCommand.scala:317)
> 	at kafka.admin.TopicCommand$.main(TopicCommand.scala:69)
> 	at kafka.admin.TopicCommand.main(TopicCommand.scala)
>  (kafka.admin.TopicCommand$)
> {code}
> If I restart the Kafka service on a broker that is *not* "Controller" it doesn't come up again. It just spams the following in the log for all different partitions and it's not possible to connect to it.
> {code:java}
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0.
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:48,984] INFO [Admin Manager on Broker 1001]: Error processing create topic request CreatableTopic(name='example-message-consumer', numPartitions=10, replicationFactor=3, assignments=[], configs=[]) (kafka.server.ZkAdminManager)
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0.
> Oct 27 09:27:49 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:49,021] INFO [Admin Manager on Broker 1001]: Error processing create topic request CreatableTopic(name='__consumer_offsets', numPartitions=50, replicationFactor=3, assignments=[], configs=[CreateableTopicConfig(name='compression.type', value='producer'), CreateableTopicConfig(name='cleanup.policy', value='compact'), CreateableTopicConfig(name='segment.bytes', value='104857600')]) (kafka.server.ZkAdminManager)
> {code}
> To get out of this state I simply restart the Kafka service on the "Controller". It will make another become "Controller" and all brokers becomes OK again and the newly created topic gets all expected ISRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)