You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Olsson, Daniel" <Ol...@DNB.com.INVALID> on 2021/10/27 07:35:15 UTC

Replica leader election stops when zk connection is re-established

Hi all!

We have a problem with 6 of our Kafka clusters since we upgraded to 2.8.0 from 2.3.1 a few months back. A seventh cluster is still on 2.3.1 and never had this problem.
The cluster runs fine for a random period, days, or weeks. Suddenly when creating new topics, they never get assigned partitions. It gets no ISR, and leader is "none". When using the zkCli to browse the topic it has no partitions.
When this happens, we have been forced to restart the Kafka service on the "controller" host that will cause a new controller to be elected and that solves the problem.

I've found out that after the Zookeeper leader host rebooted, the Kafka "Controller" host stopped with "Processing automatic preferred replica leader election" messages in the log, even though it reconnected fine. This seems related.
When trying to run the kafka-leader-election.sh (using --bootstrap-server) for all topics it fails saying that none of the partitions/topics exist.

Oct 26 20:17:06 ip-10-227-143-9 kafka[598]: [2021-10-26 20:17:06,099] INFO [Controller id=1002] Skipping replica leader election (PREFERRED) for partition example-topic-51 by AdminClientTriggered since it doesn't exist. (kafka.controller.KafkaController)
Oct 26 20:17:06 ip-10-227-143-9 kafka[598]: [2021-10-26 20:17:06,099] INFO [Controller id=1002] Skipping replica leader election (PREFERRED) for partition another-example-topic-7 by AdminClientTriggered since it doesn't exist. (kafka.controller.KafkaController)
etc..

However, it is possible to consume messages from the same bootstrap server from an old topic. So, it looks like the Kafka Controller ends up in limbo state where it is connected and registered with Zookeeper, but it doesn't get any data from Zookeeper.

I have still not been able to find a good way to reproduce this.
No errors or warnings in the logs on either Zookeeper or Kafka.

Zookeeper is running 3.5.9 with 3 nodes
The Kafka clusters are in the size of 3, 5 or 7 nodes.

Does anybody have an idea what happens?
What is triggering the automatic replica leader elections? Is that Zookeeper?

Thanks!