You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "JoshuaZhuCN (via GitHub)" <gi...@apache.org> on 2023/01/29 07:47:52 UTC

[GitHub] [hudi] JoshuaZhuCN commented on issue #7602: [SUPPORT] When does the Spark engine's bulk insert mode support bucket index

JoshuaZhuCN commented on issue #7602:
URL: https://github.com/apache/hudi/issues/7602#issuecomment-1407590148

   > w/ bucket index, what perf issue you are seeing. From what I know, there may not any small file handling only even w/ "insert" as operation type if you are using bucket index. So, it should be pretty close to bulk_insert. I mean, even if we add bucket index support to bulk_insert, it will perform similar to how insert works as of today w/ bulk_insert.
   > 
   > Essentially, we take hash of record key and find the file group to insert. and this goes into merge handle where we merge incoming records w/ existing file group.
   
   @nsivabalan If the bucket index is written in the insert mode, the log is generated first, and the parquet can be regenerated only after the compact is triggered. Unlike other index file generation methods, the other index inserts generate the parquet, and only up and del can generate the log


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org