You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/19 01:52:52 UTC

[GitHub] [hudi] vinothchandar commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

vinothchandar commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711466156


   >However what's the recommended approach in terms of indexing here ? I see various features are available out of the box.
   
   Been meaning to write a blog that walks through the options here. Interested in being a reviewer? That will also help explain this better for yourself as well :) 
   
   In short, you can pick options based on your workload (we intend to make dynamic_bloom default in 0.7.0 going forward, which should help with this issue?) 
   
   - If you have records where there are ordered keys (e.g timestamp prefix), then Bloom index with range pruning will do an excellent job. It will be able to quickly prune out large number of files to compare against and just use bloom filters for the rest.
   - if you have records with no ordering in them (e.g uuid), but the pattern is such that mostly the recent partitions are updated with a long tail of updates/deletes to the older partitions, then still bloom index will be faster. but better to turn off range pruning, since it does not help, just incurs the cost of checking.  
   
   - If your update patterns are totally random i.e each commit affects almost every file, you can use SIMPLE_INDEX (which will join against the entire table) or if you have HBase - HBASE index (you can write your own index as well, its pluggable). We are working on adding a record level index natively within Hudi in the next major release hopefully 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org