You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Rajesh Mahindra (Jira)" <ji...@apache.org> on 2022/02/01 04:31:00 UTC

[jira] [Updated] (HUDI-2458) Relax compaction in metadata being fenced based on inflight requests in data table

     [ https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Mahindra updated HUDI-2458:
----------------------------------
    Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Relax compaction in metadata being fenced based on inflight requests in data table
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-2458
>                 URL: https://issues.apache.org/jira/browse/HUDI-2458
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: sivabalan narayanan
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data table.
> Compaction in metadata is triggered only if there are no inflight requests in data table. This might cause liveness problem since for very large deployments, we could either have compaction or clustering always in progress. So, we should try to see how we can relax this constraint.
>  
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away with this. 
> As of now, we have 3 inter linked nuances.
>  - Compaction in metadata may not kick in, if there are any inflight operations in data table. 
>  - Rollback when being applied to metadata table has a dependency on last compaction instant in metadata table. We might even throw exception if instant being rolledback is < latest metadata compaction instant time. 
>  - Archival in data table is fenced by latest compaction in metadata table. 
>  
> So, just incase data timeline has any dangling inflght operation (lets say someone tried clustering, and killed midway and did not ever attempt again), metadata compaction will never kick in at all for good. I need to check what does archival do for such inflight operations in data table though when it tries to archive near by commits. 
>  
> So, with spurious deletes support which we added recently, all these can be much simplified. 
> Whenever we want to apply a rollback commit, we don't need to take different actions based on whether the commit being rolled back is already committed to metadata table or not. Just go ahead and apply the rollback. Merging of metadata payload records will take care of this. If the commit was already synced, final merged payload may not have spurious deletes. If the commit being rolledback was never committed to metadata, final merged payload may have some spurious deletes which we can ignore. 
> With this, compaction in metadata does not need to have any dependency on inflight operations in data table. 
> And we can loosen up the dependency of archival in data table on metadata table compaction as well. 
> So, in summary, all the 3 dependencies quoted above will be moot if we go with this approach. Archival in data table does not have any dependency on metadata table compaction. Rollback when being applied to metadata table does not care about last metadata table compaction. Compaction in metadata table can proceed even if there are inflight operations in data table. 
>  
> Especially our logic to apply rollback metadata to metadata table will become a lot simpler and is easy to reason about. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)