You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2021/10/06 16:17:00 UTC

[jira] [Resolved] (HUDI-2159) Supporting Clustering and Metadata Table together

     [ https://issues.apache.org/jira/browse/HUDI-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan resolved HUDI-2159.
---------------------------------------
    Resolution: Fixed

> Supporting Clustering and Metadata Table together
> -------------------------------------------------
>
>                 Key: HUDI-2159
>                 URL: https://issues.apache.org/jira/browse/HUDI-2159
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Blocker
>             Fix For: 0.10.0
>
>
> I am testing clustering support for metadata enabled table and found a few issues.
> *Setup*
> Pipeline 1: Ingestion pipeline with Metadata Table enabled. Runs every 30 mins. 
> Pipeline 2: Clustering pipeline with long running jobs (3-4 hours)
> Pipeline 3: Another clustering pipeline with long running jobs (3-4 hours)
>  
> *Issue #1: Parallel commits on Metadata Table*
> Assume the Clustering pipeline is completing T5.replacecommit and ingestion pipeline is completing T10.commit. Metadata Table will synced at an instant <T5 (Say T4) since it only sync in completion order.
> Now both the pipelines will call syncMetadataTable() which will do the following:
>  # Find all un-synced instants from dataset (T5, T6 ... T10)
>  # Read each instant and perform a deltacommit on the Metadata Table with the same timestamp as instant.
> There is a chance that two processed perform deltacommit at T5 on the metadata table and one will fail (instant file already exists). This will be an exception raised and will be detected as failure of pipeline leading to false-positive alerts.
>  
> *Issue #2: No archiving/rollback support for failed clustering operations*
> If a clustering operation fails, it leaves a left-over T5.replacecommit.inflight. There is no automated way to rollback or archive these. Since clustering is a long running operation in general and may be run through multiple pipelines at the same time, automated rollback of left-over inflights doesnt work as we cannot be sure that the process is dead.
> Metadata Table sync only works in completion order. So if T5.replacecommit.inflight is left-over, Metadata Table will not sync beyond T5 causing a large number of LogBLocks to pile up which will have to be merged in memory leading to deteriorating performance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)