You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/10/06 00:55:09 UTC

[GitHub] [kafka] gardnervickers commented on pull request #7929: KAFKA-9393: DeleteRecords may cause extreme lock contention for large partition directories

gardnervickers commented on pull request #7929:
URL: https://github.com/apache/kafka/pull/7929#issuecomment-703966954

@junrao I was able to get some more time to pick this back up, apologies for the long gap in time since my last update.

In the most recent commit, I've moved away from tracking producer state snapshot files on the segment and instead maintain an in-memory cache using the ProducerStateManager.

There were a few reasons for doing this. Firstly, it became a bit awkward trying to track the producer state snapshot file which we emit during clean shutdown since it does not really have an associated segment file. Second, the producer state truncation/restore logic would need to be moved into the `Log`, since complete view of all producer state snapshot files are required. This is further complicated by the way we handle corrupt snapshot files. Lastly, because we use a "fake" `ProducerStateManager` during segment recovery, we'd have to duplicate a lot of the same logic as exists today to handle truncation/loading outside of the segment lifecycle.

Instead, I opted to take an approach similar to the way the `Log` manages segments, by having a `ConcurrentNavigableMap` which tracks snapshot files in memory. As a result, the logic for truncation and restore largely remains the same, but instead of scanning the log directory on every operation we query the in-memory map instead. Deletions are handled in the same way that segment deletions are handled, where the snapshot file is deleted asynchronously along with the segment file. Because we scan the logdir at startup for "stray" snapshot files, it's unnecessary to rename the snapshot files pending deletion with the `.deleted` suffix.

This approach has two downsides which I think are relatively minor.

1. When a broker shuts down cleanly and emits a snapshot file, the emitted snapshot file is considered "stray" on the next broker startup. While we will clean all "stray" snapshot files except the most recent, we still keep the most recent snapshot file around until the next broker restart. This will result in a single "stray" snapshot file remaining until the next broker restart, at which point the "stray" snapshot file will be deleted.
2. Because we construct a temporary `ProducerStateManager` during segment recovery, and it may delete/create some snapshot files, we need to re-scan the log directory for snapshot files after segment loading is completed but before we load producer state. This is to ensure that the in-memory mapping for the "real" `ProducerStateManager` is up to date.

Snapshot file deletion is triggered via `Log.deleteSegmentFiles` when deletion occurs due to retention. When swapping log segments into the log (like with compaction), it appears we have the additional limitation that we don't want to delete snapshot files purely based on the base offset of the deleted segment file. To handle this case, we check to see if the segment file which is being deleted due to the swap has a counterpart new segment which is being swapped in, if it does, we do not delete the snapshot file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org