You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2019/10/26 03:56:00 UTC

[jira] [Comment Edited] (HUDI-106) Dynamically tune bloom filter entries

    [ https://issues.apache.org/jira/browse/HUDI-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960255#comment-16960255 ] 

sivabalan narayanan edited comment on HUDI-106 at 10/26/19 3:55 AM:
--------------------------------------------------------------------

sure vinoth. I am yet to start getting my hands dirty with code base. But a naive question. 

Based on my reading of concepts, during compaction we know the total number of entries for a given file group. So in that case, why can't we create a regular bloom filter with the right size rather than using hard coded(or config based) value. Wondering is DynamicBF is really necessary here. Or this is mainly catered towards the delta logs and not parquet? 


was (Author: shivnarayan):
sure vinoth. I am yet to starting my hands dirty with code base. But a naive question. 

Based on my reading of concepts, during compaction we know the total number of entries for a given file group. So in that case, why can't we create a regular bloom filter with the right size rather than using hard coded(or config based) value. Wondering is DynamicBF is really necessary here. Or this is mainly catered towards the delta logs and not parquet? 

> Dynamically tune bloom filter entries
> -------------------------------------
>
>                 Key: HUDI-106
>                 URL: https://issues.apache.org/jira/browse/HUDI-106
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Vinoth Chandar
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: realtime-data-lakes
>             Fix For: 0.5.1
>
>
> Tuning bloom filters is currently based on a configuration, that could be cumbersome to tune per dataset to obtain good indexing performance.. Lets add support for Dynamic Bloom Filters, that can automatically achieve a configured false positive ratio depending on number of entries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)