You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/03/21 04:59:23 UTC

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1043: PARQUET-2260 Bloom filter size shouldn't be larger than maxBytes in the configuration

wgtmac commented on code in PR #1043:
URL: https://github.com/apache/parquet-mr/pull/1043#discussion_r1142882212


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -85,6 +85,22 @@ public class BlockSplitBloomFilter implements BloomFilter {
   private static final int[] SALT = {0x47b6137b, 0x44974d91, 0x8824ad5b, 0xa2b7289d,
     0x705495c7, 0x2df1424b, 0x9efc4947, 0x5c6bfb31};
 
+  /**
+   * Create a Bloom filter within the maximum bytes size. The bloom filter bytes size
+   * will be a largest power of 2 less than the maximum bytes size.
+   *
+   * @param maxBytes The max number of bytes for Bloom filter.
+   * @return
+   */
+  public static BlockSplitBloomFilter of(int maxBytes) {

Review Comment:
   Why adding a new function but not modify the constructor directly? It still has the risk if user uses the old constructor.



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -218,6 +234,19 @@ private void initBitset(int numBytes) {
     this.intBuffer = ByteBuffer.wrap(bitset).order(ByteOrder.LITTLE_ENDIAN).asIntBuffer();
   }
 
+  /**
+   * Calculate the largest power of 2 in the range [lowerBound,upperBound].
+   */
+  public static int largestTwoPowerInRange(int lowerBound, int upperBound) {

Review Comment:
   ```suggestion
     private static int largestPowerOfTwoInRange(int lowerBound, int upperBound) {
   ```



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   Does this change solve the problem?



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   I found that there is a constructor doing the similar thing:
   ```java
     /**
      * Constructor of block-based Bloom filter.
      *
      * @param numBytes The number of bytes for Bloom filter bitset. The range of num_bytes should be within
      *                 [DEFAULT_MINIMUM_BYTES, maximumBytes], it will be rounded up/down
      *                 to lower/upper bound if num_bytes is out of range. It will also be rounded up to a power
      *                 of 2. It uses XXH64 as its default hash function.
      * @param maximumBytes The maximum bytes of the Bloom filter.
      */
     public BlockSplitBloomFilter(int numBytes, int maximumBytes) {
       this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64);
     }
   ```



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   ```suggestion
         this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize, maxBloomFilterSize);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org