You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Lucian Ilie (Jira)" <ji...@apache.org> on 2023/10/16 08:03:00 UTC

[jira] [Resolved] (KAFKA-15572) Race condition between log roll and log rename

     [ https://issues.apache.org/jira/browse/KAFKA-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lucian Ilie resolved KAFKA-15572.
---------------------------------
    Resolution: Fixed

> Race condition between log roll and log rename
> ----------------------------------------------
>
>                 Key: KAFKA-15572
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15572
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.0.0
>            Reporter: Lucian Ilie
>            Priority: Major
>         Attachments: kafka-alter-log-dir-nosuchfileexception.log
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> We are using a Kafka cluster with Zookeeper, deployed on top of Kubernetes, using banzaicloud/koperator.
> We have multiple disks per broker.
> We are using Cruise Control remove disk operation in order to aggregate multiple smaller disks into a single bigger disk. This CC operation is calling Kafka admin with alter replica log dirs operation.
> During this operation, *the flush operation fails apparently randomly with NoSuchFileException, while alterReplicaLogDirs is executed*. Attached a sample of logs for the exception and the previous operations taking place.
> Will further detail the cause of this issue.
> Say we have 3 brokers:
>  * broker 101 with disks /kafka-logs1/kafka, /kafka-logs2/kafka and a bigger disk /new-kafka-logs1/kafka
>  * broker 201 with same disks
>  * broker 301 with same disks
> When Cruise Control executes a remove disk operation, it calls Kafka "adminClient.alterReplicaLogDirs(replicaAssignment)" with such an assignment as to move all data from /kafka-logs1/kafka and /kafka-logs2/kafka to /new-kafka-logs1/kafka.
> During the alter log dir operation, future logs are created (to move data from e.g. "/kafka-logs1/kafka/topic-partition" to "/new-kafka-logs1/kafka/topic-partition.hash-future"), data is moved and finally the log dir will be renamed from "/new-kafka-logs1/kafka/topic-partition.hash-future" to "/new-kafka-logs1/kafka/topic-partition". This operation is started in [UnifiedLog.renameDir|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/UnifiedLog.scala#L713] and is locked using the [UnifiedLog lock|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/UnifiedLog.scala#L268]. The rename is then delegated to [LocalLog.renameDir|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/LocalLog.scala#L111-L113]. This is the 1st code part that is involved in the race condition.
> Meanwhile, log dirs can be rolled based on known conditions (e.g. getting full), which will call [UnifiedLog.roll|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/UnifiedLog.scala#L1547], which is locked using the [UnifiedLog lock|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/UnifiedLog.scala#L268]. However, the further delegation to UnifiedLog.flushUptoOffsetExclusive is not sharing that lock, since it is [done as a scheduled task in a separate thread|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/UnifiedLog.scala#L1547]. This means that further operations are [not locked at UnifiedLog level|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/UnifiedLog.scala#L268]. The operation is further delegated to [LocalLog.flush|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/LocalLog.scala#L177], which will also try to [flush the log dir|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/LocalLog.scala#L177]. This is the 2nd code part that is involved in the race condition.
> Since the log dir flush does not share the lock with the rename dir operation (as it is scheduled via the scheduler), the rename dir operation might succeed in moving the log dir on disk to "topic-partition", but the LocalLog._dir will remain set to "topic-partition.hash-future", and when the flush will attempt to flush the "topic-partition.hash-future" directory, it will throw NoSuchFileException: "topic-partition.hash-future". Basically, [this line|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/LocalLog.scala#L111] might succeed, and before [this other line|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/LocalLog.scala#L111] is executed, flush tries to [flush the future dir|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/LocalLog.scala#L177].
> We tested a fix with a patch on Kafka 3.4.1, on our clusters and it solved the issue by synchronizing the flush dir operation. Will reply with a link to a PR.
> Note that this bug replicates for every version since 3.0.0, caused by [this commit|https://github.com/apache/kafka/commit/db3e5e2c0de367ffcfe4078359d6d208ba722581#diff-eeafed82ed6a8600c397b108787fdf31e03191b0a192774a65c127d0d26edc44R2341] when flush dir was added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)