You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2019/12/23 23:52:00 UTC
[jira] [Resolved] (KAFKA-9307) Transaction coordinator could be left in unknown state after ZK session timeout

     [ https://issues.apache.org/jira/browse/KAFKA-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Gustafson resolved KAFKA-9307.
------------------------------------
    Resolution: Fixed

> Transaction coordinator could be left in unknown state after ZK session timeout
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-9307
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9307
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1
>            Reporter: Dhruvil Shah
>            Assignee: Dhruvil Shah
>            Priority: Major
>             Fix For: 2.2.3, 2.3.2, 2.4.1
>
>
> We observed a case where the transaction coordinator could not load transaction state from __transaction-state topic partition. Clients would continue seeing COORDINATOR_LOAD_IN_PROGRESS exceptions until the broker hosting the coordinator is restarted.
> This is the sequence of events that leads to the issue:
>  # The broker is the leader of one (or more) transaction state topic partitions.
>  # The broker loses its ZK session due to a network issue.
>  # Broker reestablishes session with ZK, though there are still transient network issues.
>  # Broker is made follower of the transaction state topic partition it was leading earlier.
>  # During the become-follower transition, the broker loses its ZK session again.
>  # The become-follower transition for this broker fails in-between, leaving us in a partial leader / partial follower state for the transaction topic. This meant that we could not unload the transaction metadata. However, the broker successfully caches the leader epoch of associated with the LeaderAndIsrRequest.
>  # Later, when the ZK session is finally established successfully, the broker ignores the become-follower transition as the leader epoch was same as the one it had cached. This prevented the transaction metadata from being unloaded.
>  # Because this partition was a partial follower, we had setup replica fetchers. The partition continued to fetch from the leader until it was made part of the ISR.
>  # Once it was part of the ISR, preferred leader election kicked in and elected this broker as the leader.
>  # When processing the become-leader transition, the transaction state load operation failed as we already had transaction metadata loaded at a previous epoch.
>  # This meant that this partition was left in the "loading" state and we thus returned COORDINATOR_LOAD_IN_PROGRESS errors.
> Restarting the broker that hosts the transaction state coordinator is the only way to recover from this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)