You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/09/19 14:47:00 UTC

[jira] [Assigned] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS

     [ https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan reassigned HUDI-4878:
-----------------------------------------

    Assignee: nicolas paris

> Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
> ----------------------------------------------------------------
>
>                 Key: HUDI-4878
>                 URL: https://issues.apache.org/jira/browse/HUDI-4878
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: cleaning
>            Reporter: sivabalan narayanan
>            Assignee: nicolas paris
>            Priority: Major
>
> clean based on LATEST_FILE_VERSIONS can be improved further since incremental clean is not enabled. lets see if we can improvise. 
>  
> context from author:
>  
>  
> Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS
> policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
> This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
> Here is the scenario of the problem:
> Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):
>  {{partition-A/
> commit-0.parquet
> commit-1.parquet
> commit-2.parquet
> partition-B/
> commit-3.parquet}}
> In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits.
> For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner.
> Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit.
> This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS
> Hope this is clear enough
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)