You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Stanislav Kozlovski (Jira)" <ji...@apache.org> on 2020/07/23 15:05:00 UTC
[jira] [Created] (KAFKA-10301) RemoteReplicasMap can be empty in
certain race conditions
Stanislav Kozlovski created KAFKA-10301:
-------------------------------------------
Summary: RemoteReplicasMap can be empty in certain race conditions
Key: KAFKA-10301
URL: https://issues.apache.org/jira/browse/KAFKA-10301
Project: Kafka
Issue Type: Bug
Reporter: Stanislav Kozlovski
Assignee: Stanislav Kozlovski
In Partition#updateAssignmentAndIsr, we would previously update the `partition#remoteReplicasMap` by adding the new replicas to the map and then removing the old ones ([source]([https://github.com/apache/kafka/blob/7f9187fe399f3f6b041ca302bede2b3e780491e7/core/src/main/scala/kafka/cluster/Partition.scala#L657)]
During a recent refactoring, we changed it to first clear the map and then add all the replicas to it ([source]([https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/cluster/Partition.scala#L663]))
While this is done in a write lock (`inWriteLock(leaderIsrUpdateLock)`), not all callers that access the map structure use a lock. Some examples:
- Partition#updateFollowerFetchState
- DelayedDeleteRecords#tryComplete
- Partition#getReplicaOrException - called in `checkEnoughReplicasReachOffset` without a lock, which itself is called by DelayedProduce. I think this can fail a `ReplicaManager#appendRecords` call.
While we want to polish the code to ensure these sort of race conditions become harder (or impossible) to introduce, it sounds safest to revert to the previous behavior given the timelines regarding the 2.6 release. Jira X tracks that.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)