You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by bl...@apache.org on 2019/08/26 23:27:37 UTC

[parquet-format] branch master updated: PARQUET-1630: Update Bloom filter format (#146)

This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 3fb10e0  PARQUET-1630: Update Bloom filter format (#146)
3fb10e0 is described below

commit 3fb10e00c2204bf1c6cc91e094c59e84cefcee33
Author: Chen, Junjie <ji...@tencent.com>
AuthorDate: Tue Aug 27 07:27:32 2019 +0800

    PARQUET-1630: Update Bloom filter format (#146)
---
 BloomFilter.md                        |  18 ++++++++++++++----
 doc/images/FileLayoutBloomFilter1.png | Bin 0 -> 44025 bytes
 doc/images/FileLayoutBloomFilter2.png | Bin 0 -> 34018 bytes
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/BloomFilter.md b/BloomFilter.md
index b8208c8..2fa24e9 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -264,10 +264,13 @@ false positive rates:
 |                       41   |  0.001 %                   |
 
 #### File Format
-The Bloom filter data of a column chunk, which contains the size of the filter in bytes, the
-algorithm, the hash function and the Bloom filter bitset, is stored near the footer. The Bloom
-filter data offset is stored in column chunk metadata. Here are Bloom filter definitions in
-thrift:
+
+Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block
+bloom filter consists of the bloom filter header followed by the bloom filter bitset. The bloom filter
+header encodes the size of the bloom filter bit set in bytes that is used to read the bitset.
+
+Here are the Bloom filter definitions in thrift:
+
 
 ```
 /** Block-based algorithm type annotation. **/
@@ -323,6 +326,13 @@ struct ColumnMetaData {
 
 ```
 
+The Bloom filters are grouped by row group and with data for each column in the same order as the file schema.
+The Bloom filter data can be stored before the page indexes after all row groups. The file layout looks like:
+ ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter2.png)
+
+Or it can be stored between row groups, the file layout looks like:
+ ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter1.png)
+
 #### Encryption
 In the case of columns with sensitive data, the Bloom filter exposes a subset of sensitive
 information such as the presence of value. Therefore the Bloom filter of columns with sensitive
diff --git a/doc/images/FileLayoutBloomFilter1.png b/doc/images/FileLayoutBloomFilter1.png
new file mode 100644
index 0000000..3b21738
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter1.png differ
diff --git a/doc/images/FileLayoutBloomFilter2.png b/doc/images/FileLayoutBloomFilter2.png
new file mode 100755
index 0000000..6bbf770
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter2.png differ