You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2023/01/14 06:34:27 UTC

[GitHub] [parquet-mr] yabola opened a new pull request, #1020: PARQUET-2226 Support merge two Bloom Filters

yabola opened a new pull request, #1020:
URL: https://github.com/apache/parquet-mr/pull/1020

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

wgtmac commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591970


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -398,18 +398,21 @@ public long hash(Binary value) {
 
   @Override
   public void merge(BloomFilter otherBloomFilter) throws IOException {
-    Preconditions.checkArgument((otherBloomFilter.getAlgorithm() == getAlgorithm()),
-      "BloomFilter algorithm should be same");
-    Preconditions.checkArgument((otherBloomFilter.getHashStrategy() == getHashStrategy()),
-      "BloomFilter hashStrategy should be same");
-    Preconditions.checkArgument((otherBloomFilter.getBitsetSize() == getBitsetSize()),
-      "BloomFilter bitset size should be same");
+    Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null BloomFilter");
+    Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),

Review Comment:
   nit: you don't have to write String.format explicitly
   ```
       Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),
               "BloomFilters must have the same algorithm (%s != %s)",
                       getAlgorithm(), otherBloomFilter.getAlgorithm());
   ```



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+    Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null BloomFilter");
+    Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),
+      String.format("BloomFilters must have the same algorithm (%s != %s)",
+        getAlgorithm(), otherBloomFilter.getAlgorithm()));
+    Preconditions.checkArgument((getHashStrategy() == otherBloomFilter.getHashStrategy()),
+      String.format("BloomFilters must have the same hashStrategy (%s != %s)",
+        getHashStrategy(), otherBloomFilter.getHashStrategy()));
+    Preconditions.checkArgument((getBitsetSize() == otherBloomFilter.getBitsetSize()),
+      String.format("BloomFilters must have the same size of bitsets (%s != %s)",
+        getBitsetSize(), otherBloomFilter.getBitsetSize()));
+    ByteArrayOutputStream otherOutputStream = new ByteArrayOutputStream();
+    otherBloomFilter.writeTo(otherOutputStream);
+    byte[] otherBits = otherOutputStream.toByteArray();

Review Comment:
   should we check the equality of bitset.length and otherBits.length?



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   In this case, I'd suggest not use random values.



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   Actually it does not matter what the data type is. We can simplify the test by writing two lists of hash values to create two BFs. Then we are pretty sure the test result of each value. What do you think?



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,83 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeEmptyBloomFilter() throws IOException {

Review Comment:
   Probably we need a test to expect throwing exception when two BFs are not compatible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky merged pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

gszadovszky merged PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637029


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+    Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null BloomFilter");
+    Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),
+      String.format("BloomFilters must have the same algorithm (%s != %s)",
+        getAlgorithm(), otherBloomFilter.getAlgorithm()));
+    Preconditions.checkArgument((getHashStrategy() == otherBloomFilter.getHashStrategy()),
+      String.format("BloomFilters must have the same hashStrategy (%s != %s)",
+        getHashStrategy(), otherBloomFilter.getHashStrategy()));
+    Preconditions.checkArgument((getBitsetSize() == otherBloomFilter.getBitsetSize()),
+      String.format("BloomFilters must have the same size of bitsets (%s != %s)",
+        getBitsetSize(), otherBloomFilter.getBitsetSize()));
+    ByteArrayOutputStream otherOutputStream = new ByteArrayOutputStream();
+    otherBloomFilter.writeTo(otherOutputStream);
+    byte[] otherBits = otherOutputStream.toByteArray();

Review Comment:
   I had checked the `getBitsetSize` before, so we may don't need to checke the bitset.length ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070590290


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,21 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+    Preconditions.checkArgument((otherBloomFilter.getAlgorithm() == getAlgorithm()),

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383715909

   @wgtmac Thank you for your detailed review and @gszadovszky help.
   My jira id is miracle


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070590238


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();

Review Comment:
   I have splitted different types with different parameter, I am not sure if is suitable.



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##########
@@ -176,4 +176,13 @@ public String toString() {
    * @return compress algorithm that the bloom filter apply
    */
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying bits
+   *
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter with.
+   */
+  default void merge(BloomFilter otherBloomFilter) throws IOException {
+    throw new UnsupportedOperationException("Not supported merge operation.");

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   > Actually it does not matter what the data type is. We can simplify the test by writing two lists of hash values to create two BFs. Then we are pretty sure the test result of each value. What do you think?
   
   Good idea, I removed the random number logic~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382737603

   One more thing, @yabola. The compatibility tests fail because you have added a new method to a public interface. Even though this interface is not supposed to be implemented by our clients the current project structure does not allow filtering these classes out from the compat checks. The easiest way would be to add a default implementation to the interface which throws an exception (not implemented).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] shangxinli commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

shangxinli commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383006013

   @chenjunjiedada Do you still have time to review this change? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383783806

   Sure. :)
   Please double-check the jira if I assigned it to the correct one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

wgtmac commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070853151


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+    Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null BloomFilter");
+    Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),
+      "BloomFilters must have the same algorithm (%s != %s)",
+        getAlgorithm(), otherBloomFilter.getAlgorithm());
+    Preconditions.checkArgument((getHashStrategy() == otherBloomFilter.getHashStrategy()),
+      "BloomFilters must have the same hashStrategy (%s != %s)",
+        getHashStrategy(), otherBloomFilter.getHashStrategy());
+    Preconditions.checkArgument((getBitsetSize() == otherBloomFilter.getBitsetSize()),

Review Comment:
   Maybe we should put this check to the 1st place to make it fail fast. Usually parameters above do not change.



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {

Review Comment:
   I still think it is helpful to extract the check logic to a separate method like `boolean canMergeFrom(BloomFilter other)` because user has to check these parameters anyway before calling the `merge` method. Then `merge` implementation can simply call `canMergeFrom`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability that the two BloomFilters will be inconsistent when judging whether there is a hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

wgtmac commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070546933


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   Can you add some cases like below to have more coverage:
   - value that is tested false on the original BF but true on the other side, and vice versa?
   - value that is tested false on both sides.
   - value that is tested true on both sides.



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,21 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+    Preconditions.checkArgument((otherBloomFilter.getAlgorithm() == getAlgorithm()),

Review Comment:
   It would be more user-friendly to print the mismatched values on both sides.



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##########
@@ -176,4 +176,13 @@ public String toString() {
    * @return compress algorithm that the bloom filter apply
    */
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying bits
+   *
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter with.
+   */
+  default void merge(BloomFilter otherBloomFilter) throws IOException {
+    throw new UnsupportedOperationException("Not supported merge operation.");

Review Comment:
   ```suggestion
       throw new UnsupportedOperationException("Merge is not implemented.");
   ```



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();

Review Comment:
   Better to split different types into separate test cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   > Actually it does not matter what the data type is. We can simplify the test by writing two lists of hash values to create two BFs. Then we are pretty sure the test result of each value. What do you think?
   
   I removed the random number logic and change~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267028


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##########
@@ -176,4 +176,10 @@ public String toString() {
    * @return compress algorithm that the bloom filter apply
    */
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter with.
+   */
+  void putAll(BloomFilter otherBloomFilter) throws IOException;

Review Comment:
   > IMHO, user needs to know if two BFs are compatible to merge before calling this function. So an utility function to test compatibility of two BFs is also required. WDYT?
   
   I had added some check before merging bloomFilter. I see that guava does not seem to have additional utility function.If necessary, I can add an utility function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267080


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,21 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void putAll(BloomFilter otherBloomFilter) throws IOException {

Review Comment:
   > Could you add some tests to verify if the merged BF is as expected?
   
   Sure, I will add some UT later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   Because I added  random value, if the BloomFilter to be merged is not empty, there is a small probability that the two BloomFilters will be inconsistent when judging whether there is a hash value.
   
   But if the BloomFilter to be merged is empty in the beginning, the result from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability that the two BloomFilters will be inconsistent when judging whether there is a hash value.
   
   But if the BloomFilter to be merged is empty in the beginning, the result from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070638035


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,83 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeEmptyBloomFilter() throws IOException {

Review Comment:
   I added a test for two BFs are not compatible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   > Actually it does not matter what the data type is. We can simplify the test by writing two lists of hash values to create two BFs. Then we are pretty sure the test result of each value. What do you think?
   
   Good idea, I removed the random number logic and change~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383689874

   @yabola, what is your jira account? I'd like to assign the jira to you before closing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382736990

   Thanks, @yabola for working on this and also to @wgtmac for reviewing. I do not have much experience with bloom filters so I will rely on your review. Ping me if you have a +1.
   @chenjunjiedada, any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267028


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##########
@@ -176,4 +176,10 @@ public String toString() {
    * @return compress algorithm that the bloom filter apply
    */
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter with.
+   */
+  void putAll(BloomFilter otherBloomFilter) throws IOException;

Review Comment:
   > IMHO, user needs to know if two BFs are compatible to merge before calling this function. So an utility function to test compatibility of two BFs is also required. WDYT?
   
   Thanks for your review. I had added some check before merging bloomFilter. I see that guava does not seem to have additional utility function.If necessary, I can add an utility function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383102223

   @wgtmac I had added unit test, please take a look~ thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability that the two BloomFilters will be inconsistent when judging whether there is a hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result from these two BloomFilter should be always the same.
   
   I add two different test case, I am not sure if I need to add some more.



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability that the two BloomFilters will be inconsistent when judging whether there is a hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result from these two BloomFilter should be always the same.
   
   I add two different test cases, I am not sure if I need to add some more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022


##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   Because I added  random value, if the BloomFilter to be merged is not empty, there is a small probability that the two BloomFilters will be inconsistent when judging whether there is a hash value.
   But if the BloomFilter to be merged is empty in the beginning, the result from these two BloomFilter should be always the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070636680


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -398,18 +398,21 @@ public long hash(Binary value) {
 
   @Override
   public void merge(BloomFilter otherBloomFilter) throws IOException {
-    Preconditions.checkArgument((otherBloomFilter.getAlgorithm() == getAlgorithm()),
-      "BloomFilter algorithm should be same");
-    Preconditions.checkArgument((otherBloomFilter.getHashStrategy() == getHashStrategy()),
-      "BloomFilter hashStrategy should be same");
-    Preconditions.checkArgument((otherBloomFilter.getBitsetSize() == getBitsetSize()),
-      "BloomFilter bitset size should be same");
+    Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null BloomFilter");
+    Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),

Review Comment:
   Thanks, done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070907144


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {

Review Comment:
   I have added `canMergeFrom` method, please take a look if it is suitable~



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,24 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void merge(BloomFilter otherBloomFilter) throws IOException {
+    Preconditions.checkArgument(otherBloomFilter != null, "Cannot merge a null BloomFilter");
+    Preconditions.checkArgument((getAlgorithm() == otherBloomFilter.getAlgorithm()),
+      "BloomFilters must have the same algorithm (%s != %s)",
+        getAlgorithm(), otherBloomFilter.getAlgorithm());
+    Preconditions.checkArgument((getHashStrategy() == otherBloomFilter.getHashStrategy()),
+      "BloomFilters must have the same hashStrategy (%s != %s)",
+        getHashStrategy(), otherBloomFilter.getHashStrategy());
+    Preconditions.checkArgument((getBitsetSize() == otherBloomFilter.getBitsetSize()),

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge two Bloom Filters

Posted by GitBox <gi...@apache.org>.

yabola commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382679860

   @shangxinli @gszadovszky Can you help take a look if it is suitable~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Posted by GitBox <gi...@apache.org>.

wgtmac commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070233210


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -394,4 +395,21 @@ public long hash(float value) {
   public long hash(Binary value) {
     return hashFunction.hashBytes(value.getBytes());
   }
+
+  @Override
+  public void putAll(BloomFilter otherBloomFilter) throws IOException {

Review Comment:
   Could you add some tests to verify if the merged BF is as expected?



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##########
@@ -176,4 +176,10 @@ public String toString() {
    * @return compress algorithm that the bloom filter apply
    */
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter with.
+   */
+  void putAll(BloomFilter otherBloomFilter) throws IOException;

Review Comment:
   IMHO, user needs to know if two BFs are compatible to merge before calling this function. So an utility function to test compatibility of two BFs is also required. WDYT?



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java:
##########
@@ -176,4 +176,10 @@ public String toString() {
    * @return compress algorithm that the bloom filter apply
    */
   Compression getCompression();
+
+  /**
+   * Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data
+   * @param otherBloomFilter The Bloom filter to combine this Bloom filter with.
+   */
+  void putAll(BloomFilter otherBloomFilter) throws IOException;

Review Comment:
   Rename to `merge`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org