You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2020/10/08 00:09:00 UTC

[jira] [Commented] (HUDI-1325) Handle the corner case with syncing completed compaction from data timeline to metadata timeline.

    [ https://issues.apache.org/jira/browse/HUDI-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209938#comment-17209938 ] 

Prashant Wason commented on HUDI-1325:
--------------------------------------

[~vinoth] One way to handle this can be to merge (in-memory) the changes from the compaction instant. 

So on the reader side, we read:
1. From the hfile base file
2. From the log file blocks upto the last-sync instant
3. The in-memory changes

Since this issue only arises for async compaction and async clean operation, it should not be too much overhead. Also, this is probably not required for async clean as queries should not care about files which may be asnc cleaned anyways.

> Handle the corner case with syncing completed compaction from data timeline to metadata timeline. 
> --------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-1325
>                 URL: https://issues.apache.org/jira/browse/HUDI-1325
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: Prashant Wason
>            Priority: Major
>
> Here is a corner case with syncing completed compaction from data timeline to metadata timeline. Consider the following sequence of events
> t0: writer schedules compaction at time instant c
> t1: Compactor starts processing c's plan
> t2: compaction finishes with c.commit published on the data timeline (not yet synced to metadata timeline)
> t3: Next round of writing, writer opens metadata table, which adds the base file produced in c.commit to metadata table.
> Any queries running between t2 and t3, cannot rely on metadata since the new base file will not be present in metadata table. The timeline will indicate that the compaction completed, and the latest file slice will be computed as simply the logs written to the file groups since compaction. This will lead to incorrect results.
> If we consider just writer alone, we may be okay since we first sync the metadata table before we do anything for the delta commit at t3. But in general for queries, we should advise enabling metadata table based listings only, after all writers/cleaner/compactor have been enabled to use metadata and been successfully using it to publish new/deleted files directly to the metadata table. In short, queries cannot rely on metadata table, with the syncing mechanism as the main thing that keeps data and metadata timelines together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)