You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jose Armando Garcia Sancio (Jira)" <ji...@apache.org> on 2020/11/20 20:10:00 UTC

[jira] [Commented] (KAFKA-9672) Dead brokers in ISR cause isr-expiration to fail with exception

    [ https://issues.apache.org/jira/browse/KAFKA-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236416#comment-17236416 ] 

Jose Armando Garcia Sancio commented on KAFKA-9672:
---------------------------------------------------

I was not able to reproduce this issue but looking at the code and the trace of messages sent by the controller this is what I think it is happening.

Assuming that the initial partition assignment and state is:
{code:java}
Replicas: 0, 1, 2
ISR: 0, 1, 2
Leader: 0
LeaderEpoch: 1{code}
This state is replicated to all of the replicas (0, 1, 2) using the LeaderAndIsr requests.

When the user attempts to perform a reassignment of replacing 0 with 3, the controller bumps the epoch and assignment info
{code:java}
Replicas: 0, 1, 2, 3
Adding: 3
Removing: 0
ISR: 0, 1, 2
Leader: 0
LeaderEpoch: 2{code}
This state is replicated to all of the replicas (0, 1, 2, 3) using the LeaderAndIsr request.

The system roughly stays in this state until the all of the target replicas have join the ISR. When all of the target replicas have join the ISR the controller wants to perform the following flow:

1 - The controller moves the leader if necessary (leader is not in the new replicas set) and stops the leader from letting "removing" replicas to join the ISR.

The second requirement (stopping the leader from adding "removing" replicas to the ISR) is accomplished by bumping the leader epoch and only sending the new leader epoch to the target replicas (1, 2, 3). Unfortunately, due to how the controller is implemented this is accomplished by deleting the "removing" replicas from the in memory state without modifying the ISR state. At this point we have the ZK state:
{code:java}
Replicas: 0, 1, 2, 3
Adding:
Removing: 0
ISR: 0, 1, 2, 3
Leader: 1
LeaderEpoch: 3{code}
but the following LeaderAndIsr requests are sent to replicas 1, 2, 3
{code:java}
Replicas: 1, 2, 3
Adding:
Removing:
ISR: 0, 1, 2, 3
Leader: 1
LeaderEpoch: 3{code}
This works because replica 0 will have an invalid leader epoch which means that it's Fetch request will be ignored by the (new) leader.

2 - The controller removes replica 0 from the ISR by updating ZK and sending the appropriate LeaderAndIsr requests.

3 - The controller removes replica 0 from the replica set by updating ZK and sending the appropriate LeaderAndIsr requests.

 

Conclusion

If this flow executes to completion, everything is okay. The problem is what happens if step 2. and 3. don't get to execute. I am unable to reproduce this with tests or by walking the code but if 2. and 3. don't execute but the controller stays alive there is a flow where the controller persists the following state to ZK
{code:java}
Replicas: 1, 2, 3
Adding:
Removing:
ISR: 0, 1, 2, 3
Leader: 1
LeaderEpoch: 3{code}
Which causes the reassignment flow to terminate with the system staying in this state. This state is persistent at this line in the controller code:

https://github.com/apache/kafka/blob/43fd630d80332f2b3b3512a712100825a8417704/core/src/main/scala/kafka/controller/KafkaController.scala#L728

> Dead brokers in ISR cause isr-expiration to fail with exception
> ---------------------------------------------------------------
>
>                 Key: KAFKA-9672
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9672
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.0, 2.4.1
>            Reporter: Ivan Yurchenko
>            Assignee: Jose Armando Garcia Sancio
>            Priority: Major
>
> We're running Kafka 2.4 and facing a pretty strange situation.
>  Let's say there were three brokers in the cluster 0, 1, and 2. Then:
>  1. Broker 3 was added.
>  2. Partitions were reassigned from broker 0 to broker 3.
>  3. Broker 0 was shut down (not gracefully) and removed from the cluster.
>  4. We see the following state in ZooKeeper:
> {code:java}
> ls /brokers/ids
> [1, 2, 3]
> get /brokers/topics/foo
> {"version":2,"partitions":{"0":[2,1,3]},"adding_replicas":{},"removing_replicas":{}}
> get /brokers/topics/foo/partitions/0/state
> {"controller_epoch":123,"leader":1,"version":1,"leader_epoch":42,"isr":[0,2,3,1]}
> {code}
> It means, the dead broker 0 remains in the partitions's ISR. A big share of the partitions in the cluster have this issue.
> This is actually causing an errors:
> {code:java}
> Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id 12 is not available on broker 17
> {code}
> It means that effectively {{isr-expiration}} task is not working any more.
> I have a suspicion that this was introduced by [this commit (line selected)|https://github.com/apache/kafka/commit/57baa4079d9fc14103411f790b9a025c9f2146a4#diff-5450baca03f57b9f2030f93a480e6969R856]
> Unfortunately, I haven't been able to reproduce this in isolation.
> Any hints about how to reproduce (so I can write a patch) or mitigate the issue on a running cluster are welcome.
> Generally, I assume that not throwing {{ReplicaNotAvailableException}} on a dead (i.e. non-existent) broker, considering them out-of-sync and removing from the ISR should fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)