You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/09/04 01:16:00 UTC

[jira] [Updated] (HUDI-4773) Adding filter mode to Clustering to filter for recent files

     [ https://issues.apache.org/jira/browse/HUDI-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HUDI-4773:
---------------------------------
    Labels: pull-request-available  (was: )

> Adding filter mode to Clustering to filter for recent files
> -----------------------------------------------------------
>
>                 Key: HUDI-4773
>                 URL: https://issues.apache.org/jira/browse/HUDI-4773
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: clustering
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available
>
> We have partition aware clustering strategy and recent partitions based strategy as well for clustering. This plays out well if partitioning is based on dates. but what incase partitioning is based on some other random field. 
>  
> So, we might need another clustering filtering strategy to consider only those file groups which got touched in the last N commits. 
> For eg, if a user configures clustering to run every 5 commits, every time clustering runs, it will consider only the file groups touched in the last 5 commits. This will avoid triggering repeated clustering for already clustered file groups as well and clustering will be very fast only delta file groups are considered. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)