You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2020/10/15 04:42:00 UTC

[jira] [Created] (KAFKA-10614) Group coordinator onElection/onResignation should guard against leader epoch

Guozhang Wang created KAFKA-10614:
-------------------------------------

             Summary: Group coordinator onElection/onResignation should guard against leader epoch
                 Key: KAFKA-10614
                 URL: https://issues.apache.org/jira/browse/KAFKA-10614
             Project: Kafka
          Issue Type: Bug
          Components: core
            Reporter: Guozhang Wang


When there are a sequence of LeaderAndISR or StopReplica requests sent from different controllers causing the group coordinator to elect / resign, we may re-order the events due to race condition. For example:

1) First LeaderAndISR request received from old controller to resign as the group coordinator.
2) Second LeaderAndISR request received from new controller to elect as the group coordinator.
3) Although threads handling the 1/2) requests are synchronized on the replica manager, their callback {{onLeadershipChange}} would trigger {{onElection/onResignation}} which would schedule the loading / unloading on background threads, and are not synchronized.
4) As a result, the {{onElection}} maybe triggered by the thread first, and then {{onResignation}}. As a result, the coordinator would not recognize it self as the coordinator and hence would respond any coordinator request with {{NOT_COORDINATOR}}.

Here are two proposals on top of my head:

1) Let the scheduled load / unload function to keep the passed in leader epoch, and also materialize the epoch in memory. Then when execute the unloading check against the leader epoch.

2) This may be a bit simpler: using a single background thread working on a FIFO queue of loading / unloading jobs, since the caller are actually synchronized on replica manager and order preserved, the enqueued loading / unloading job would be correctly ordered as well. In that case we would avoid the reordering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)