You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2021/08/07 05:43:00 UTC
[jira] [Updated] (HUDI-2285) Metadata Table Synchronous Design

     [ https://issues.apache.org/jira/browse/HUDI-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prashant Wason updated HUDI-2285:
---------------------------------
    Description: 
h2. *Motivation*

HUDI Metadata Table version 1 (0.7 release) supports file-listing optimization. We intend to add support for additional information - record-level index(UUID), column indexes (column range index) to the metadata table. This requires re-architecting the table design for large scale (50billion+ records), synchronous operations and to reduce the reader-side overhead.
 # Limit the amount of sync requirement on the reader side
 # Syncing on reader side may negate the benefits of the secondary index 
 # Not syncing on the reader-side simplifies design and reduces testing

 # Allow moving to a multi-writer design with operations running in separate pipelines
 # E.g. Clustering / Clean / Backfills in separate pipelines

 # Ease of debugging 
 # Scale - Should be able to handle very large inserts - millions of keys, thousands of datafiles written

 
h3. *Writer Side*

The lifecycle of a HUDI operation will be as listed below. The example below shows COMMIT operation but the steps apply for all types of operations.
 # SparkHoodieWriteClient.commit(...) is called by ingestion process at time T1
 # Create requested instant t1.commit.requested
 # Create inflight instant t1.commit.inflight
 # Perform the write of RDD into the dataset and create the HoodieCommitMetadata
 # HoodieMetadataTableWriter.update(CommitMetadata, t1, WriteStatus)
 # This will perform a delta-commit into the HUDI Metadata Table updating the file listing, record-level index (future) and column indexes (future) together from the data collected in the WriteStatus.
 # This commit will complete before the commit started on the dataset will complete.
 # This will create the t1.deltacommit on the Metadata Table.
 # Since Metadata Table has inline clean and inline compaction, those additional operations may also take place at this time

 # Complete the commit by creating t1.commit

Inline-compaction will only compact those log blocks which can be deemed readable as per the algorithm described in the reader-side in the next section.
h3. *Reader Side*
 # List the dataset to find all completed instants - e.g. t1.commit, t2.commit … t10.commit
 # Since these instants are completed, their related metadata has already been written to the metadata table as part of respective deltacommits - t1.deltacommit, t2.deltacommit … t10.deltacommit

 # Find the last completed instant on the dataset - t10.commit
 # Open the FileSlice on the metadata partition with the following constraints:
 # Any base file with time > t10 cannot be used
 # Any log blocks whose timestamp is not in the list of completed instants (#1 above) cannot be used

 # Only in ingestion failure cases the latest base file (created due to compaction) or some log blocks may have to be neglected. In success cases, this process should not add extra overhead except for listing the dataset.

 
h3. *Multi Write Support*

Since each operation on metadata table writes to the same files (file-listing partition has a single FileSlice), we can only allow single-writer access to the metadata table. For this, the Transaction Manager is used to lock the table before any updates.

In essence, each multi-writer operation will contend for the same lock to write updates to the metadata table before the operation completes. This may not even be an issue in reality as the operations will complete at different times and the metadata table operations should be fast.

  was:
h2. *Motivation*

HUDI Metadata Table version 1 (0.7 release) supports file-listing optimization. We intend to add support for additional information - record-level index(UUID), column indexes (column range index) to the metadata table. This requires re-architecting the table design for large scale (50billion+ records), synchronous operations and to reduce the reader-side overhead.
 # Limit the amount of sync requirement on the reader side
 # Syncing on reader side may negate the benefits of the secondary index 
 # Not syncing on the reader-side simplifies design and reduces testing


 # Allow moving to a multi-writer design with operations running in separate pipelines
 # E.g. Clustering / Clean / Backfills in separate pipelines


 # Ease of debugging 
 # Scale - Should be able to handle very large inserts - millions of keys, thousands of datafiles written

 
h3. *Writer Side*

The lifecycle of a HUDI operation will be as listed below. The example below shows COMMIT operation but the steps apply for all types of operations.
 # SparkHoodieWriteClient.commit(...) is called by ingestion process at time T1
 # Create requested instant t1.commit.requested
 # Create inflight instant t1.commit.inflight
 # Perform the write of RDD into the dataset and create the HoodieCommitMetadata
 # HoodieMetadataTableWriter.update(CommitMetadata, t1, WriteStatus)
 # This will perform a delta-commit into the HUDI Metadata Table updating the file listing, record-level index (future) and column indexes (future) together from the data collected in the WriteStatus.
 # This commit will complete before the commit started on the dataset will complete.
 # This will create the t1.deltacommit on the Metadata Table.
 # Since Metadata Table has inline clean and inline compaction, those additional operations may also take place at this time


 # Complete the commit by creating t1.commit

Inline-compaction will only compact those log blocks which can be deemed readable as per the algorithm described in the reader-side in the next section.
h3. *Reader Side*
 # List the dataset to find all completed instants - e.g. t1.commit, t2.commit … t10.commit
 # Since these instants are completed, their related metadata has already been written to the metadata table as part of respective deltacommits - t1.deltacommit, t2.deltacommit … t10.deltacommit


 # Find the last completed instant on the dataset - t10.commit
 # Open the FileSlice on the metadata partition with the following constraints:
 # Any base file with time > t10 cannot be used
 # Any log blocks whose timestamp is not in the list of completed instants (#1 above) cannot be used


 # Only in ingestion failure cases the latest base file (created due to compaction) or some log blocks may have to be neglected. In success cases, this process should not add extra overhead except for listing the dataset.


> Metadata Table Synchronous Design
> ---------------------------------
>
>                 Key: HUDI-2285
>                 URL: https://issues.apache.org/jira/browse/HUDI-2285
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>
> h2. *Motivation*
> HUDI Metadata Table version 1 (0.7 release) supports file-listing optimization. We intend to add support for additional information - record-level index(UUID), column indexes (column range index) to the metadata table. This requires re-architecting the table design for large scale (50billion+ records), synchronous operations and to reduce the reader-side overhead.
>  # Limit the amount of sync requirement on the reader side
>  # Syncing on reader side may negate the benefits of the secondary index 
>  # Not syncing on the reader-side simplifies design and reduces testing
>  # Allow moving to a multi-writer design with operations running in separate pipelines
>  # E.g. Clustering / Clean / Backfills in separate pipelines
>  # Ease of debugging 
>  # Scale - Should be able to handle very large inserts - millions of keys, thousands of datafiles written
>  
> h3. *Writer Side*
> The lifecycle of a HUDI operation will be as listed below. The example below shows COMMIT operation but the steps apply for all types of operations.
>  # SparkHoodieWriteClient.commit(...) is called by ingestion process at time T1
>  # Create requested instant t1.commit.requested
>  # Create inflight instant t1.commit.inflight
>  # Perform the write of RDD into the dataset and create the HoodieCommitMetadata
>  # HoodieMetadataTableWriter.update(CommitMetadata, t1, WriteStatus)
>  # This will perform a delta-commit into the HUDI Metadata Table updating the file listing, record-level index (future) and column indexes (future) together from the data collected in the WriteStatus.
>  # This commit will complete before the commit started on the dataset will complete.
>  # This will create the t1.deltacommit on the Metadata Table.
>  # Since Metadata Table has inline clean and inline compaction, those additional operations may also take place at this time
>  # Complete the commit by creating t1.commit
> Inline-compaction will only compact those log blocks which can be deemed readable as per the algorithm described in the reader-side in the next section.
> h3. *Reader Side*
>  # List the dataset to find all completed instants - e.g. t1.commit, t2.commit … t10.commit
>  # Since these instants are completed, their related metadata has already been written to the metadata table as part of respective deltacommits - t1.deltacommit, t2.deltacommit … t10.deltacommit
>  # Find the last completed instant on the dataset - t10.commit
>  # Open the FileSlice on the metadata partition with the following constraints:
>  # Any base file with time > t10 cannot be used
>  # Any log blocks whose timestamp is not in the list of completed instants (#1 above) cannot be used
>  # Only in ingestion failure cases the latest base file (created due to compaction) or some log blocks may have to be neglected. In success cases, this process should not add extra overhead except for listing the dataset.
>  
> h3. *Multi Write Support*
> Since each operation on metadata table writes to the same files (file-listing partition has a single FileSlice), we can only allow single-writer access to the metadata table. For this, the Transaction Manager is used to lock the table before any updates.
> In essence, each multi-writer operation will contend for the same lock to write updates to the metadata table before the operation completes. This may not even be an issue in reality as the operations will complete at different times and the metadata table operations should be fast.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)