You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/09/19 14:46:00 UTC
[jira] [Created] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
sivabalan narayanan created HUDI-4878:
-----------------------------------------
Summary: Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
Key: HUDI-4878
URL: https://issues.apache.org/jira/browse/HUDI-4878
Project: Apache Hudi
Issue Type: Improvement
Components: cleaning
Reporter: sivabalan narayanan
clean based on LATEST_FILE_VERSIONS can be improved further since incremental clean is not enabled. lets see if we can improvise.
context from author:
Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS
policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
Here is the scenario of the problem:
Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):
{{partition-A/
commit-0.parquet
commit-1.parquet
commit-2.parquet
partition-B/
commit-3.parquet}}
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits.
For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner.
Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit.
This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS
Hope this is clear enough
--
This message was sent by Atlassian Jira
(v8.20.10#820010)