You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/11/02 17:47:31 UTC

[GitHub] [incubator-hudi] vinothchandar commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

vinothchandar commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-549065995
 
 
   @nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that? 
   
   >I am trying to find ways to test the FP ratio. Not sure how would you test that.
   The way I have done it in the past, is to generate a lot of key and hold it in two lists : added, notAdded.. I add the ones from `added` to bloom filter and then check for false positives using notAdded list.. how much % of not added had a hit is your fp.. For this impl, we need ensure that the fp ratio remains the same, even as you increase the size of added/notAdded lists.. 
   
   Other small points:
   - if we don't have a way to configure the bloom type to use, we should add one
   - We should consider if the default here should be `the error rate 10^-9`. This will also help reduce the size.. we already have techniques like range pruning  to reduce the amount of comparisons.. Assuming even a large 100M entries inserted into a partition, if the bloom filter had `10^-8`, it might be enough prevent  false positives right.. I guess this will drop the storage needed considerably? 
   
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services