You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Xuwei Fu (Jira)" <ji...@apache.org> on 2023/03/13 14:31:00 UTC

[jira] [Created] (PARQUET-2256) Adding Compression for BloomFilter

Xuwei Fu created PARQUET-2256:
---------------------------------

             Summary: Adding Compression for BloomFilter
                 Key: PARQUET-2256
                 URL: https://issues.apache.org/jira/browse/PARQUET-2256
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
    Affects Versions: format-2.9.0
            Reporter: Xuwei Fu


In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:

 

```

 /**
 * The compression used in the Bloom filter.
 **/
struct Uncompressed {}
union BloomFilterCompression {
  1: Uncompressed UNCOMPRESSED;
+2: CompressionCodec COMPRESSION;
}

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)