You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/26 02:02:00 UTC

[jira] [Commented] (PARQUET-319) Define the parquet bloom filter statistics in parquet format

    [ https://issues.apache.org/jira/browse/PARQUET-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628158#comment-16628158 ] 

ASF GitHub Bot commented on PARQUET-319:
----------------------------------------

winningsix closed pull request #28: PARQUET-319: Define the parquet bloom filter statistics in parquet format
URL: https://github.com/apache/parquet-format/pull/28
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.gitignore b/.gitignore
index cb047215..094f7bde 100644
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,5 @@ target
 dependency-reduced-pom.xml
 .classpath
 .project
+.idea
+*.iml
diff --git a/Statistics.md b/Statistics.md
new file mode 100644
index 00000000..f7e76782
--- /dev/null
+++ b/Statistics.md
@@ -0,0 +1,71 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+Parquet statistics definitions
+===
+This file contains the specification of all supported encodings.
+
+### Bloom filter statistics
+
+Supported Types: INT32, INT64, INT96, BINARY, FLOAT, DOUBLE, BYTE_ARRAY
+
+The bloom filter is stored per row group which is used to filter some blocks when reading the parquet file.
+
+#### Background for bloom filter
+In short, bloom filter maintains a bit set which is set to 1 for all bits. Once one entry added to the filter, it will use the hash functions to set several bits to 1. Using thge same hash functions,
+we can see if a new entry is already included checking all of the related bits value. If all of them are set to 1, it means this entry is already added, otherwise not.
+In this way we can use the bloom filter to skip some data in block level for parquet. From the Wikipedia, the definition of bloom filter is follown as:
+```An empty Bloom filter is a bit array of m bits, all set to 0.
+There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.```
+To construct a bloom filter, two parameters are required:
+* _m_: Number of bits
+* _k_: Number of hash function
+These two parameters could be approximated by two other parameters (The formulas are available in the [Wikipedia](https://en.wikipedia.org/wiki/Bloom_filter)):
+
+* _f_: abbreviation for false positive probability
+* _n_: the number of elements stored in bloom filter
+
+### The structure of bloom filter in parquet-format
+In the statistics, we have the following information for bloom filter.
+```
+list<i64> bitSet: the bit set storing the generated bit values using added entries
+i32 numBits; number of bit which is evaluated in the way discussed in the previous section
+i32 numHashFunctions; number of hash functions to generate the bit from the encountering entries
+```
+
+#### Implementation of the bloom filter
+Currently two bloom filter strategies are supported. One is using two 32-bit hash functions and another is using all 128 bits of Murmur3 128 when hashing. The first one is mentioned in "Less Hashing, Same Performance: Building a Better Bloom Filter" by Kirsch et.al. From abstract 'only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability' Lets split up 64-bit hashcode into two 32-bit hash codes and employ the technique mentioned in the above paper
+
+MurmurHash3 yields a 32-bit or 128-bit hash value. When using 128-bits, the x86 and x64 versions do not produce the same values, as the algorithms are optimized for their respective platforms according to [WIKIPEDIA](https://en.wikipedia.org/wiki/MurmurHash).
+
+#### Build a bloom filter
+To build a bloom filter, the number of expected entries number and FPP value are required.
+FPP means false positive probability and the expected entries number means the number of values stored in bloom filter.
+
+#### How BloomFilter works in Parquet
+The bloom filter is part of the statistics like Min/Max Statistics.
+For EQ operation, it will be checked if one entry is included in the row group.
+If the number isn't within the range or not include in the bloom filter, it will be dropped.
+
+#### Examples
+Assume that 12, 14, 16 and 17 in one block. And in case of executing the statement `SELECT * FROM tble_name WHERE id > 11 AND id < 17` on the table `tble_name` defined in the structure `(id Int, value Int)`.
+For Min/Max statistics, this block will not be skipped in the reader initial stage since it accepts the Statistics filter.
+With bloom filter enabled, this block will be dropped since the bit set stored in the bloom stats doesn't have all bits 16 required.
+
+
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 47812abc..d50d8fe7 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -208,6 +208,30 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+enum BloomFilterStrategy{
+  /**
+   * See "Less Hashing, Same Performance: Building a Better Bloom Filter" by Adam Kirsch and
+   * Michael Mitzenmacher. The paper argues that this trick doesn't significantly deteriorate the
+   * performance of a Bloom filter (yet only needs two 32bit hash functions).
+   */
+  MURMUR128_32 = 0;
+
+  /**
+   * This strategy uses all 128 bits of {@link Hashing#murmur3_128} when hashing. It looks
+   * different than the implementation in MURMUR128_MITZ_32 because we're avoiding the
+   * multiplication in the loop and doing a (much simpler) += hash2. We're also changing the
+   * index to a positive number by AND'ing with Long.MAX_VALUE instead of flipping the bits.
+   */
+  MURMUR128_64 = 1;
+}
+
+struct BloomFilter{
+    1: required binary bitSet;
+    2: required i32 numBits;
+    3: required i32 numHashFunctions;
+    4: required BloomFilterStrategy bloomFilterStrategy;
+}
+
 /**
  * Statistics per row group and per page
  * All fields are optional.
@@ -240,6 +264,7 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   7: optional BloomFilter bloom_filter;
 }
 
 /**


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Define the parquet bloom filter statistics in parquet format
> ------------------------------------------------------------
>
>                 Key: PARQUET-319
>                 URL: https://issues.apache.org/jira/browse/PARQUET-319
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-format
>            Reporter: Ferdinand Xu
>            Assignee: Ferdinand Xu
>            Priority: Major
>              Labels: pull-request-available
>
> As discussed in Parquet-41, we should define the bloom filter in binary level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)