You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/09/19 14:46:00 UTC

[jira] [Created] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS

sivabalan narayanan created HUDI-4878:
-----------------------------------------

             Summary: Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
                 Key: HUDI-4878
                 URL: https://issues.apache.org/jira/browse/HUDI-4878
             Project: Apache Hudi
          Issue Type: Improvement
          Components: cleaning
            Reporter: sivabalan narayanan


clean based on LATEST_FILE_VERSIONS can be improved further since incremental clean is not enabled. lets see if we can improvise. 

 

context from author:

 

 

Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS
policies. It is not run when KEEP_LATEST_FILE_VERSIONS.

This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.

Here is the scenario of the problem:

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):
 {{partition-A/
commit-0.parquet
commit-1.parquet
commit-2.parquet
partition-B/
commit-3.parquet}}
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits.
For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner.

Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit.

This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS

Hope this is clear enough

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)