You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Mars (Jira)" <ji...@apache.org> on 2023/05/12 02:50:00 UTC

[jira] [Updated] (PARQUET-2254) Build a BloomFilter with a more precise size

     [ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mars updated PARQUET-2254:
--------------------------
    Description: 
*Why are the changes needed?*
Now the usage of bloom filter is to specify the NDV(number of distinct values) or max bytes, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.

*What changes were proposed in this pull request?*
`AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as candidates and inserts values in
 the candidates at the same time. Finally we will choose the smallest candidate to write out.


*Does this PR introduce any user-facing change?*
add new configuration:
`parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable writing adaptive bloom filter.  
If it is true, the bloom filter will be generated with the optimal bit size according to the number of real data distinct values. If it is false, it will not take effect.
Note that the maximum bytes of the bloom filter will not exceed `parquet.bloom.filter.max.bytes` configuration (if it is 
set too small, the generated bloom filter will not be efficient).

`parquet.bloom.filter.candidates.number`: default 5, the number of candidate bloom filters written at the same time.  
When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate bloom filters will be inserted 
at the same time, finally a bloom filter with the optimal bit size will be selected and written to the file.

 

  was:
h3. Why are the changes needed?

Now the usage of bloom filter is to specify the NDV(number of distinct values), and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
h3. What changes were proposed in this pull request?

{{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as candidates and inserts values in the candidates at the same time. Use the largest bloom filter as an approximate deduplication counter, and then remove incapable bloom filter candidates during data insertion.


> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> *Why are the changes needed?*
> Now the usage of bloom filter is to specify the NDV(number of distinct values) or max bytes, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
> *What changes were proposed in this pull request?*
> `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as candidates and inserts values in
>  the candidates at the same time. Finally we will choose the smallest candidate to write out.
> *Does this PR introduce any user-facing change?*
> add new configuration:
> `parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable writing adaptive bloom filter.  
> If it is true, the bloom filter will be generated with the optimal bit size according to the number of real data distinct values. If it is false, it will not take effect.
> Note that the maximum bytes of the bloom filter will not exceed `parquet.bloom.filter.max.bytes` configuration (if it is 
> set too small, the generated bloom filter will not be efficient).
> `parquet.bloom.filter.candidates.number`: default 5, the number of candidate bloom filters written at the same time.  
> When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate bloom filters will be inserted 
> at the same time, finally a bloom filter with the optimal bit size will be selected and written to the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)