You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2022/11/17 00:42:00 UTC

[jira] [Created] (KAFKA-14397) Idempotent producer may bump epoch and reset sequence numbers prematurely

Jason Gustafson created KAFKA-14397:
---------------------------------------

             Summary: Idempotent producer may bump epoch and reset sequence numbers prematurely
                 Key: KAFKA-14397
                 URL: https://issues.apache.org/jira/browse/KAFKA-14397
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


Suppose that idempotence is enabled in the producer and we send the following single-record batches to a partition leader:
 * A: epoch=0, seq=0
 * B: epoch=0, seq=1
 * C: epoch=0, seq=2

The partition leader receives all 3 of these batches and commits them to the log. However, the connection is lost before the `Produce` responses are received by the client. Subsequent retries by the producer all fail to be delivered.

It is possible in this scenario for the first batch `A` to reach the delivery timeout before the subsequence batches. This triggers the following check: [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.] Depending whether retries are exhausted, we may adjust sequence numbers.

The intuition behind this check is that if retries have not been exhausted, then we saw a fatal error and the batch could not have been written to the log. Hence we should bump the epoch and adjust the sequence numbers of the pending batches since they are presumed to be doomed to failure. So in this case, batches B and C might get reset with the bumped epoch:
 * B: epoch=1, seq=0
 * C: epoch=1, seq=1

This can result in duplicate records in the log.

The root of the issue is that this logic does not account for expiration of the delivery timeout. When the delivery timeout is reached, the number of retries is still likely much lower than the max allowed number of retries (which is `Int.MaxValue` by default).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)