You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Radu Radutiu <rr...@gmail.com> on 2019/03/20 14:27:08 UTC

Kafka cluster not recovering after zookeeper and nodes failure

Hello Kafka users,

We have tested failure scenarios and found the following situation in which
the kafka cluster will not automatically recover.

Cluster setup: 3 VMs (n1,n2,n3) running Centos 7, each VM runs a zookeper
v3.4.13 and kafka v2.1.0 instance, configured as systemd services, OpenJDK
1.8.0_191.
Current situation: n1 Kafka controller, n2 and n3 leaders for some
partitions:

~/kafka/bin/kafka-topics.sh --zookeeper n1:2181 --describe --topic
__consumer_offsets
Topic:__consumer_offsets PartitionCount:1 ReplicationFactor:3
Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2


If I reboot simultaneously both n2 and n3, causing zookeeper to lose quorum
and topics to lose their leaders, the kafka cluster will never recover.
Zookeper regains quorum as soon as n2 and n3 are back up, n1 remains kafka
controller, but I get the following error in all 3 kafka logs, repeated
forever:

[2019-03-20 09:43:02,524] ERROR [KafkaApi-2] Number of alive brokers '0'
does not meet the required replication factor '3' for the offsets topic
(configured via 'offsets.topic.replication.factor'). This error can be
ignored if the cluster is starting up and not all brokers are up yet.
(kafka.server.KafkaApis)
[2019-03-20 09:43:02,830] ERROR [KafkaApi-2] Number of alive brokers '0'
does not meet the required replication factor '3' for the offsets topic
(configured via 'offsets.topic.replication.factor'). This error can be
ignored if the cluster is starting up and not all brokers are up yet.
(kafka.server.KafkaApis)
[2019-03-20 09:43:03,486] ERROR [KafkaApi-2] Number of alive brokers '0'
does not meet the required replication factor '3' for the offsets topic
(configured via 'offsets.topic.replication.factor'). This error can be
ignored if the cluster is starting up and not all brokers are up yet.
(kafka.server.KafkaApis)

If I restart the kafka process on n1 (old controller) - the cluster fully
recovers. However the old controller does not shut down gracefully (I see
"Retrying controlled shutdown after the previous attempt failed..." in the
logs and is eventually killed by systemd).

I could not reproduce the problem  if one of the rebooted nodes is the
controller.
It looks to me like a race condition, as I can only reproduce the problem
about half the time.

Best regards,
Radu