You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/06/09 21:19:00 UTC

[jira] [Assigned] (HUDI-4216) Add support for infinite retention of data files with archival enabled

     [ https://issues.apache.org/jira/browse/HUDI-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan reassigned HUDI-4216:
-----------------------------------------

    Assignee: sivabalan narayanan

> Add support for infinite retention of data files with archival enabled 
> -----------------------------------------------------------------------
>
>                 Key: HUDI-4216
>                 URL: https://issues.apache.org/jira/browse/HUDI-4216
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: archiving
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 0.12.0
>
>
> We can support infinite retention with hudi (with archival enabled), it would be a pretty good use-case for those who may want to query hudi table for any time in the past. 
>  
> How to achieve: 
> - Disable cleaner completely. 
> - Enable archival as usual. 
> - Enable metadata table and so file listing can scale well. 
> Let users query hudi with "as.of.timestamp" with any timestamp in the past. 
>  
> With this, we can let users to retain all data for 1 year or even more and still query for any snapshot in the past. Obviously this comes with the additional storage cost, but if users are willing to bear the cost, we should be able to support them. 
>  
> Disabling cleaner : 
>   option("hoodie.clean.automatic","false").
>   option("hoodie.clean.async","true").
>  
> Things to fix:
> Replaced file groups, once removed the archiver, could become active file groups. For eg, if clustering replaced FG_1 and FG2, HoodieTableFileSystemView will load all file groups and then will filter out replaced file groups. FG_1 and FG_2 will be deduced as replaced if it finds a replace commit pertaining to commits for FG_1 and FG_2 in active timeline. 
> In regular flow, cleaner will clean those file groups and the timeline files may not matter after that. but here, since cleaner is completely disabled, we need to fix this. 
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)