You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/18 17:43:12 UTC

[GitHub] [hudi] tjtoll commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

tjtoll commented on issue #4873:
URL: https://github.com/apache/hudi/issues/4873#issuecomment-1072645191


   > Since you are having a complex record key, I feel the range pruning w/ bloom is not effective. Bloom filters will be effective only if your record keys have some timestamp characteristics and so we can trim few file groups with just min and max values of record keys stored in them.
   > 
   > So, I would recommend you to try out "SIMPLE" index instead. for random or large updates, this might work out better. Do give [this](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) blog a read to understand index types in hudi. Also, you can check out the configs for simple index [here](https://hudi.apache.org/docs/next/configurations#hoodiesimpleindexparallelism).
   
   Is it only the record key having the timestamp characteristics? Or it is the partitioning as well? For example, if I have a random record key but my partitions are by date is BLOOM still beneficial? 
   
   Also, on tables that I do have an incrementing record key, why doesn't Hudi sort those before writing them? The files it writes have huge/overlapping ranges of record keys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org