You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/04/18 19:52:00 UTC
[jira] [Updated] (HUDI-6098) Initial commit in MDT should use bulk insert for performance
[ https://issues.apache.org/jira/browse/HUDI-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-6098:
---------------------------------
Labels: pull-request-available (was: )
> Initial commit in MDT should use bulk insert for performance
> ------------------------------------------------------------
>
> Key: HUDI-6098
> URL: https://issues.apache.org/jira/browse/HUDI-6098
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
> Labels: pull-request-available
>
> Initial commit into MDT writes a very large number of records. With indexes like record-index (to be comitted) the number of written records is in the order of the total number of records in the dataset itself (could be in billions).
> If we use upsertPrepped to initialize the indexes then:
> # The initial commit will write data into log files
> # due to the large amount of data the write will be split into a very large number of log blocks
> # performance of lookups from the MDT will suffer greatly until a compaction is run
> # compaction will take all the log data and write into base files (HFiles) doubling the read/write IO
> By directly writing the initial commit into base files using bulkInsertPrepped API, we can remove all the issues listed above.
> This is a critical requirement for large scale indexes like record index.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)