You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gang Wu (Jira)" <ji...@apache.org> on 2023/03/13 15:12:00 UTC

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

    [ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699694#comment-17699694 ] 

Gang Wu commented on PARQUET-2256:
----------------------------------

Apache ORC supports compression of bloom filter. It would be nice if we can do the similar thing.
However, I think there is a prerequisite (at least highly relevant): https://issues.apache.org/jira/browse/PARQUET-2257

> Adding Compression for BloomFilter
> ----------------------------------
>
>                 Key: PARQUET-2256
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2256
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>    Affects Versions: format-2.9.0
>            Reporter: Xuwei Fu
>            Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)