You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2023/04/18 19:46:00 UTC

[jira] [Created] (HUDI-6098) Initial commit in MDT should use bulk insert for performance

Prashant Wason created HUDI-6098:
------------------------------------

             Summary: Initial commit in MDT should use bulk insert for performance
                 Key: HUDI-6098
                 URL: https://issues.apache.org/jira/browse/HUDI-6098
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


Initial commit into MDT writes a very large number of records. With indexes like record-index (to be comitted) the number of written records is in the order of the total number of records in the dataset itself (could be in billions). 

If we use upsertPrepped to initialize the indexes then:
 # The initial commit will write data into log files
 # due to the large amount of data the write will be split into a very large number of log blocks
 # performance of lookups from the MDT will suffer greatly until a compaction is run
 # compaction will take all the log data and write into base files (HFiles) doubling the read/write IO

By directly writing the initial commit into base files using bulkInsertPrepped API, we can remove all the issues listed above.

This is a critical requirement for large scale indexes like record index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)