You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by "Zhiting Guo (Jira)" <ji...@apache.org> on 2023/07/18 01:35:00 UTC
[jira] [Created] (KYLIN-5640) Support to automatically adjust the Bloom Filter based on data distribution
Zhiting Guo created KYLIN-5640:
----------------------------------
Summary: Support to automatically adjust the Bloom Filter based on data distribution
Key: KYLIN-5640
URL: https://issues.apache.org/jira/browse/KYLIN-5640
Project: Kylin
Issue Type: Improvement
Components: Query Engine
Affects Versions: 5.0-alpha
Reporter: Zhiting Guo
Fix For: 5.0-alpha
h3. Why are the changes needed?
Now the usage of bloom filter is to specify the NDV(number of distinct values), and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
h3. What changes were proposed in this pull request?
{{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as candidates and inserts values in the candidates at the same time. Use the largest bloom filter as an approximate deduplication counter, and then remove incapable bloom filter candidates during data insertion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)