You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2020/02/03 17:16:01 UTC

[jira] [Updated] (KAFKA-9484) Unnecessary LeaderAndIsr update following reassignment completion

     [ https://issues.apache.org/jira/browse/KAFKA-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Gustafson updated KAFKA-9484:
-----------------------------------
    Description: 
Following the completion of the reassignment, the controller executes two steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr update (in any case) with the new target replica set; second, it removes unneeded replicas from the replica set and sends another round of LeaderAndIsr updates. I am doubting the need for the first round of updates in the case that the leader doesn't needed changing. 

For example, suppose we have the following reassignment state: 

replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10

First the controller will bump the epoch with the target replica set, which will result in a round of to the target replica set with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 

Immediately following this, the controller will bump the epoch again and remove the unneeded replica. This will result in another round of LeaderAndIsr requests with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[2,3,4], leader=2, epoch=12 

The first round of LeaderAndIsr updates puzzles me a bit. It is justified in the code with this comment: 

{code} 
B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader from adding any replica in TRS - ORS back in the isr. 
{code} 
(I think the comment is backwards. It should be ORS (original replica set) - TRS (target replica set).) 

It sounds like we are trying to prevent a member of ORS from being added back to the ISR, but even if it did get added, it would be removed in the next step anyway. In the uncommon case that an ORS replica is out of sync, there does not seem to be any benefit to this first update since it is basically paying the cost of one write in order to save the speculative cost of one write. Additionally, it would be useful if the protocol could enforce the invariant that the ISR is always a subset of the replica set.

  was:
Following the completion of the reassignment, the controller executes two steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr update (in any case) with the new target replica set; second, it removes unneeded replicas from the replica set and sends another round of LeaderAndIsr updates. I am doubting the need for the first round of updates in the case that the leader doesn't needed changing. 

For example, suppose we have the following reassignment state: 

replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10

First the controller will bump the epoch with the target replica set, which will result in a round of to the target replica set with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 

Immediately following this, the controller will bump the epoch again and remove the unneeded replica. This will result in another round of LeaderAndIsr requests with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3], leader=2, epoch=12 

The first round of LeaderAndIsr updates puzzles me a bit. It is justified in the code with this comment: 

{code} 
B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader from adding any replica in TRS - ORS back in the isr. 
{code} 
(I think the comment is backwards. It should be ORS (original replica set) - TRS (target replica set).) 

It sounds like we are trying to prevent a member of ORS from being added back to the ISR, but even if it did get added, it would be removed in the next step anyway. In the uncommon case that an ORS replica is out of sync, there does not seem to be any benefit to this first update since it is basically paying the cost of one write in order to save the speculative cost of one write. Additionally, it would be useful if the protocol could enforce the invariant that the ISR is always a subset of the replica set.


> Unnecessary LeaderAndIsr update following reassignment completion
> -----------------------------------------------------------------
>
>                 Key: KAFKA-9484
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9484
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>
> Following the completion of the reassignment, the controller executes two steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr update (in any case) with the new target replica set; second, it removes unneeded replicas from the replica set and sends another round of LeaderAndIsr updates. I am doubting the need for the first round of updates in the case that the leader doesn't needed changing. 
> For example, suppose we have the following reassignment state: 
> replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10
> First the controller will bump the epoch with the target replica set, which will result in a round of to the target replica set with the following state: 
> replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 
> Immediately following this, the controller will bump the epoch again and remove the unneeded replica. This will result in another round of LeaderAndIsr requests with the following state: 
> replicas=[2,3,4], adding=[], removing=[], isr=[2,3,4], leader=2, epoch=12 
> The first round of LeaderAndIsr updates puzzles me a bit. It is justified in the code with this comment: 
> {code} 
> B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader from adding any replica in TRS - ORS back in the isr. 
> {code} 
> (I think the comment is backwards. It should be ORS (original replica set) - TRS (target replica set).) 
> It sounds like we are trying to prevent a member of ORS from being added back to the ISR, but even if it did get added, it would be removed in the next step anyway. In the uncommon case that an ORS replica is out of sync, there does not seem to be any benefit to this first update since it is basically paying the cost of one write in order to save the speculative cost of one write. Additionally, it would be useful if the protocol could enforce the invariant that the ISR is always a subset of the replica set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)