You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Luke Chen (Jira)" <ji...@apache.org> on 2023/10/19 03:31:00 UTC
[jira] [Resolved] (KAFKA-15620) Duplicate remote log DELETE_SEGMENT metadata is generated when there are multiple leader epochs in the segment
[ https://issues.apache.org/jira/browse/KAFKA-15620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen resolved KAFKA-15620.
-------------------------------
Fix Version/s: 3.7.0
Resolution: Duplicate
> Duplicate remote log DELETE_SEGMENT metadata is generated when there are multiple leader epochs in the segment
> --------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-15620
> URL: https://issues.apache.org/jira/browse/KAFKA-15620
> Project: Kafka
> Issue Type: Bug
> Components: log
> Affects Versions: 3.6.0
> Reporter: Henry Cai
> Priority: Major
> Fix For: 3.7.0, 3.6.1
>
>
> Use the newly released 3.6.0, turn on tiered storage feature:
> {*}remote.log.storage.system.enable{*}=true
> 1. Set up topic tier5 to be remote storage enabled. Adding some data to the topic and the data is copied to remote storage. After a few days when the log segment is removed from remote storage due to log retention expiration, noticed the following warnings in the server log:
> [2023-10-16 22:19:10,971] DEBUG Updating remote log segment metadata: [RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId
> {topicIdPartition=QelVeVmER5CkjrzIiF07PQ:tier5-0, id=YFNCaWjPQFSKCngQ1QcKpA}
> , customMetadata=Optional.empty, state=DELETE_SEGMENT_STARTED, eventTimestampMs=1697005926358, brokerId=1043}] (org.apache.kafka.server.log.remote.metadata.storage.RemoteLogMetadataCache)
> [2023-10-16 22:19:10,971] WARN Error occurred while updating the remote log segment. (org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore)
> org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: No remote log segment metadata found for :RemoteLogSegmentId\{topicIdPartition=QelVeVmER5CkjrzIiF07PQ:tier5-0, id=YFNCaWjPQFSKCngQ1QcKpA}
> at org.apache.kafka.server.log.remote.metadata.storage.RemoteLogMetadataCache.updateRemoteLogSegmentMetadata(RemoteLogMetadataCache.java:161)
> at org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.handleRemoteLogSegmentMetadataUpdate(RemotePartitionMetadataStore.java:89)
> at org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataEventHandler.handleRemoteLogMetadata(RemotePartitionMetadataEventHandler.java:33)
> at org.apache.kafka.server.log.remote.metadata.storage.ConsumerTask.processConsumerRecord(ConsumerTask.java:157)
> at org.apache.kafka.server.log.remote.metadata.storage.ConsumerTask.run(ConsumerTask.java:133)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> 2. After some debugging, realized the problem is *there are 2 sets of DELETE_SEGMENT_STARTED/FINISHED pairs* in the remote metadata topic for this segment. The DELETE_SEGMENT_FINISHED in the first set remove the segment from the metadata cache and this caused the above exception when the DELETE_SEGMENT_STARTED from the second set needs to be processed.
>
> 3. And traced the log to where the log retention kicked in and saw *there were two delete log segment generated* at that time:
> ```
> [2023-10-10 23:32:05,929] INFO [RemoteLogManager=1043 partition=QelVeVmER5CkjrzIiF07PQ:tier5-0] About to delete remote log segment RemoteLogSegmentId{topicIdPartition=QelVeVmER5CkjrzIiF07PQ:tier5-0, id={*}YFNCaWjPQFSKCngQ1QcKpA{*}} due to retention time 604800000ms breach based on the largest record timestamp in the segment (kafka.log.remote.RemoteLogManager$RLMTask)
> [2023-10-10 23:32:05,929] INFO [RemoteLogManager=1043 partition=QelVeVmER5CkjrzIiF07PQ:tier5-0] About to delete remote log segment RemoteLogSegmentId{topicIdPartition=QelVeVmER5CkjrzIiF07PQ:tier5-0, id={*}YFNCaWjPQFSKCngQ1QcKpA{*}} due to retention time 604800000ms breach based on the largest record timestamp in the segment (kafka.log.remote.RemoteLogManager$RLMTask)
> ```
> 4. And dumped out the content of the original COPY_SEGMENT_STARTED for this segment (which triggers the generation of the later DELETE_SEGMENT metadata):
> [2023-10-16 22:19:10,894] DEBUG Adding to in-progress state: [RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId
> {topicIdPartition=QelVeVmER5CkjrzIiF07PQ:tier5-0, id=YFNCaWjPQFSKCngQ1QcKpA}
> , startOffset=6387830, endOffset=9578660, brokerId=1043, maxTimestampMs=1696401123036, eventTimestampMs=1696401127062, segmentLeaderEpochs=\{5=6387830, 6=6721329}, segmentSizeInBytes=511987531, customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED}] (org.apache.kafka.server.log.remote.metadata.storage.RemoteLogMetadataCache)
>
> You can see there are 2 leader epochs in this segment: *segmentLeaderEpochs=\{5=6387830, 6=6721329}*
>
> 5. From the remote log retention code ([https://github.com/apache/kafka/blob/3.6.0/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L987)]
> It's bucketing segments into epochs first and then looping through epochs.
> I am not sure whether it should generate one or two DELETE_SEGMENT for this COPY_SEGMENT_START segment. If it needs to generate 2 DELETE_SEGMENT metadata, the consumer task needs to handle that duplicate metadata situation (not throwing exceptions in the log).
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)