You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Ivan Yurchenko (Jira)" <ji...@apache.org> on 2020/03/06 12:13:00 UTC

[jira] [Created] (KAFKA-9672) Dead broker in ISR cause isr-expiration to fail with exception

Ivan Yurchenko created KAFKA-9672:
-------------------------------------

             Summary: Dead broker in ISR cause isr-expiration to fail with exception
                 Key: KAFKA-9672
                 URL: https://issues.apache.org/jira/browse/KAFKA-9672
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.4.0, 2.4.1
            Reporter: Ivan Yurchenko


We're running Kafka 2.4 and facing a pretty strange situation.
 Let's say there were three brokers in the cluster 0, 1, and 2. Then:
 1. Broker 3 was added.
 2. Partitions were reassigned from broker 0 to broker 3.
 3. Broker 0 was shut down (not gracefully) and removed from the cluster.
 4. We see the following state in ZooKeeper:
{code:java}
ls /brokers/ids
[1, 2, 3]

get /brokers/topics/foo
{"version":2,"partitions":{"0":[2,1,3]},"adding_replicas":{},"removing_replicas":{}}

get /brokers/topics/foo/partitions/0/state
{"controller_epoch":123,"leader":1,"version":1,"leader_epoch":42,"isr":[0,2,3,1]}
{code}
It means, the dead broker 0 remains in the partitions's ISR. A big share of the partitions in the cluster have this issue.

This is actually causing an errors:
{code:java}
Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id 12 is not available on broker 17
{code}
It means that effectively {{isr-expiration}} task is not working any more.

I have a suspicion that this was introduced by [this commit (line selected)|https://github.com/apache/kafka/commit/57baa4079d9fc14103411f790b9a025c9f2146a4#diff-5450baca03f57b9f2030f93a480e6969R856]

Unfortunately, I haven't been able to reproduce this in isolation.

Any hints about how to reproduce (so I can write a patch) or mitigate the issue on a running cluster are welcome.

Generally, I assume that not throwing {{ReplicaNotAvailableException}} on a dead (i.e. non-existent) broker, considering them out-of-sync and removing from the ISR should fix the problem.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)