You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2023/04/18 19:46:00 UTC
[jira] [Created] (HUDI-6098) Initial commit in MDT should use bulk insert for performance
Prashant Wason created HUDI-6098:
------------------------------------
Summary: Initial commit in MDT should use bulk insert for performance
Key: HUDI-6098
URL: https://issues.apache.org/jira/browse/HUDI-6098
Project: Apache Hudi
Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason
Initial commit into MDT writes a very large number of records. With indexes like record-index (to be comitted) the number of written records is in the order of the total number of records in the dataset itself (could be in billions).
If we use upsertPrepped to initialize the indexes then:
# The initial commit will write data into log files
# due to the large amount of data the write will be split into a very large number of log blocks
# performance of lookups from the MDT will suffer greatly until a compaction is run
# compaction will take all the log data and write into base files (HFiles) doubling the read/write IO
By directly writing the initial commit into base files using bulkInsertPrepped API, we can remove all the issues listed above.
This is a critical requirement for large scale indexes like record index.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)