You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/04/18 19:52:00 UTC

[jira] [Updated] (HUDI-6098) Initial commit in MDT should use bulk insert for performance

     [ https://issues.apache.org/jira/browse/HUDI-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HUDI-6098:
---------------------------------
    Labels: pull-request-available  (was: )

> Initial commit in MDT should use bulk insert for performance
> ------------------------------------------------------------
>
>                 Key: HUDI-6098
>                 URL: https://issues.apache.org/jira/browse/HUDI-6098
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>
> Initial commit into MDT writes a very large number of records. With indexes like record-index (to be comitted) the number of written records is in the order of the total number of records in the dataset itself (could be in billions). 
> If we use upsertPrepped to initialize the indexes then:
>  # The initial commit will write data into log files
>  # due to the large amount of data the write will be split into a very large number of log blocks
>  # performance of lookups from the MDT will suffer greatly until a compaction is run
>  # compaction will take all the log data and write into base files (HFiles) doubling the read/write IO
> By directly writing the initial commit into base files using bulkInsertPrepped API, we can remove all the issues listed above.
> This is a critical requirement for large scale indexes like record index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)