You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "CS User (Jira)" <ji...@apache.org> on 2021/08/11 10:35:00 UTC

[jira] [Created] (KAFKA-13191) Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken cluster

CS User created KAFKA-13191:
-------------------------------

             Summary: Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken cluster
                 Key: KAFKA-13191
                 URL: https://issues.apache.org/jira/browse/KAFKA-13191
             Project: Kafka
          Issue Type: Bug
          Components: protocol
    Affects Versions: 2.8.0
            Reporter: CS User


We're using confluent platform 6.2, running in a Kubernetes environment. The cluster has been running for a couple of years with zero issues, starting from version 1.1, 2.5 and now 2.8. 

We've very recently upgraded to kafka 2.8 from kafka 2.5. 

Since upgraded, we have seen issues when kafka and zookeeper pods restart concurrently. 

We can replicate the issue by restarting either the zookeeper statefulset first or the kafka statefulset first, either way appears to result with the same failure scenario. 

We've attempted to mitigate by preventing the kafka pods from stopping if any zookeeper pods are being restarted, or a rolling restart of the zookeeper cluster is underway. 

We've also added a check to stop the kafka pods from starting until all zookeeper pods are ready, however under the following scenario we still see the issue:

In a 3 node kafka cluster with 5 zookeeper servers
 # kafka-2 starts to terminate - all zookeeper pods are running, so it proceeds
 # zookeeper-4 terminates
 # kafka-2 starts-up, and waits until the zookeeper rollout completes
 # kafka-2 eventually fully starts, kafka comes up and we see the errors below on other pods in the cluster. 

Without mitigation and in the above scenario we see errors on pods kafka-0/kafka-1:
{noformat}
Broker had a stale broker epoch (635655171205), retrying. (kafka.server.DefaultAlterIsrManager) {noformat}
This never appears to recover. 

If you then restart kafka-2, you'll see these errors:
{noformat}
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0. {noformat}
This seems to completely break the cluster, partitions do not failover as expected. 

 

Checking zookeeper, and getting the values of the brokers look fine 
{noformat}
 get /brokers/ids/0{noformat}
etc, all looks fine there, each broker is present. 

 

This error message appears to have been added to kafka in the last 11 months 
{noformat}
Broker had a stale broker epoch {noformat}
Via this PR:

[https://github.com/apache/kafka/pull/9100]

I see also this comment around the leader getting stuck:

https://github.com/apache/kafka/pull/9100/files#r494480847

 

Recovery is possible by continuing to restart the remaining brokers in the cluster. Once all have been restarted, everything looks fine.

Has anyone else come across this? It seems very simple to replicate in our environment, simply start a simultaneous rolling restart of both kafka and zookeeper. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)