You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/04 04:10:01 UTC

[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-753746016

@so-lazy : I am looping in @bvaradar to help you out here. But in the mean time, some context around Global_Bloom. Hudi has two kinds of indexes, regular and global. in regular bloom, all record keys within a partition are unique, but there could be same record key across diff partitions. Within same partition, hudi will take care of updating the records based on record keys and will serve you only the latest snapshot for every record key of interest.
Where as in Global versions, record keys across the entire dataset is unique. in other words, there can't be same record key in different partitions. So, incase you insert a record, rec_1 in partition1 and later try to insert the same record(rec_1) to a diff partition, say partition2, Hudi by default will update the record in partition1. But there is a config which you can set, on which case, hudi will delete this record, rec1 of interest from partition1 and will insert to partition2.
This is the major difference between regular and global versions of index. Since in Global version, all partitions need to be looked up for all records, it is known to be less performant compared to regular index. So, unless you have this requirement, would suggest you to use regular indexes (BLOOM for ex).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org