You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/08/14 15:20:11 UTC

[GitHub] [parquet-mr] huaxingao opened a new pull request #923: [PARQUET-1968] FilterApi support In predicate

huaxingao opened a new pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-1968
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698937308



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##########
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
     }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
       Changed. Thanks!

##########
File path: pom.xml
##########
@@ -478,6 +478,7 @@
                 change to fix a integer overflow issue.
                 TODO: remove this after Parquet 1.13 release -->
               <exclude>org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize</exclude>
+              <exclude>org.apache.parquet.filter2.predicate.FilterPredicate</exclude>

Review comment:
       Changed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-908849008


   @gszadovszky @shangxinli @dbtsai Thank you all very much for reviewing! I have changed the code to generate the visit methods for in/notIn and also added the default by throwing Exception. Will address the rest of the comments tomorrow or the day after tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708849684



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -250,27 +250,16 @@ public Eq(Column<T> column, T value) {
     }
   }
 
-  // base class for In and NotIn
+  // base class for In and NotIn. In is used to filter data based on a list of values. NotIn is used to filter data that
+  // are not in the list of values.

Review comment:
       Changed.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java
##########
@@ -118,6 +120,45 @@ private ColumnChunkMetaData getColumnChunk(ColumnPath columnPath) {
     return BLOCK_MIGHT_MATCH;
   }
 
+  @Override
+  public <T extends Comparable<T>> Boolean visit(Operators.In<T> in) {
+    Set<T> values = in.getValues();
+
+    if (values.contains(null)) {
+      // the bloom filter bitset contains only non-null values so isn't helpful. this
+      // could check the column stats, but the StatisticsFilter is responsible
+      return BLOCK_MIGHT_MATCH;
+    }
+
+    Operators.Column<T> filterColumn = in.getColumn();
+    ColumnChunkMetaData meta = getColumnChunk(filterColumn.getColumnPath());
+    if (meta == null) {
+      // the column isn't in this file so all values are null, but the value
+      // must be non-null because of the above check.
+      return BLOCK_CANNOT_MATCH;
+    }
+
+    try {
+      BloomFilter bloomFilter = bloomFilterReader.readBloomFilter(meta);
+      if (bloomFilter != null) {
+        for (T value : values) {
+          if (bloomFilter.findHash(bloomFilter.hash(value))) {
+            return BLOCK_MIGHT_MATCH;
+          }
+        }
+        return BLOCK_CANNOT_MATCH;
+      }
+    } catch (RuntimeException e) {
+      LOG.warn(e.getMessage());
+    }

Review comment:
       Removed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-900018869


   also cc @chenjunjiedada


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] viirya commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699604019



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Is it just enough to replace `str + "..."` to `str.append("...").toString`?

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       `str.substring(0, str.length() - 2)` is still `StringBuilder` operation. Seems fine?

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Maybe we can replace line 273 with `StringBuilder` operation too?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-899638598


   @gszadovszky @shangxinli @rdblue Could you please take a look at this PR when you have time? Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-931369233


   @gszadovszky @shangxinli @viirya @dbtsai Thank you so much for all your help!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699767598



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn

Review comment:
       Fixed. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
       Removed.

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Fixed. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;

Review comment:
       Yes, but just trying to follow the style at https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java#L150

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;
+      SetColumnFilterPredicate<?> that = (SetColumnFilterPredicate<?>) o;
+      return column.equals(that.column) && values.equals(that.values) && Objects.equals(toString, that.toString);

Review comment:
       Removed toString comparison




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-908849008


   @gszadovszky @shangxinli @dbtsai Thank you all very much for reviewing! I have changed the code to generate the visit methods for in/notIn and also added the default by throwing Exception. Will address the rest of the comments tomorrow or the day after tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-931370971


   Thank you for your contribution, @huaxingao! Great work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717696646



##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax<T> {
-  private PrimitiveComparator comparator;
-  private Iterator<T> iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
-    this.comparator = comparator;
-    this.iterator = iterator;
-    getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable<T> iterable) {

Review comment:
       Fixed. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -47,43 +39,22 @@ public T getMax() {
     return max;
   }
 
-  private void getMinAndMax() {
-    while(iterator.hasNext())  {
-      T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable<T> iterable) {
+    iterable.forEach(element -> {
       if (max == null) {
         max = element;
-      } else if (max != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
-          (element instanceof Double &&
-             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))
+      } else if (element != null) {
+        if (comparator.compare(max, element) < 0) {

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] viirya commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699604019



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Is it just enough to replace `str + "..."` to `str.append("...").toString`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698937308



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##########
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
     }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
       Changed. Thanks!

##########
File path: pom.xml
##########
@@ -478,6 +478,7 @@
                 change to fix a integer overflow issue.
                 TODO: remove this after Parquet 1.13 release -->
               <exclude>org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize</exclude>
+              <exclude>org.apache.parquet.filter2.predicate.FilterPredicate</exclude>

Review comment:
       Changed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r695929108



##########
File path: pom.xml
##########
@@ -478,6 +478,7 @@
                 change to fix a integer overflow issue.
                 TODO: remove this after Parquet 1.13 release -->
               <exclude>org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize</exclude>
+              <exclude>org.apache.parquet.filter2.predicate.FilterPredicate</exclude>

Review comment:
       However, it is not what filter2 API is designed for, technically it is possible to the user to implement this interface and by adding new methods to it we really break our API.
   What do you think about adding default implementations by throwing `UnsupportedOperationException`? This way we do not need to add this class here.

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##########
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
     }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
       I know it is quite a mass to add new stuff into `IncrementallyUpdatedFilterPredicateBuilder` (via `IncrementallyUpdatedFilterPredicateGenerator`) but the current way you are checking the values one-by-one while you have a hashset. It could be faster if the related visit methods would be generated in `IncrementallyUpdatedFilterPredicateBuilder` just like for the other predicates and hash search algorithm would be used. What do you think?

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       I am not sure how it effects performance in real life (e.g. how many values are in the set of the in/notIn predicate and how many pages do we have) but it can be done in smarter way. Column indexes are min/max values for the pages. If we sort the values in the set we can do two logarithmic searches (one for min then for max) to decide if a page might contain that value. If the column indexes themselves have "sorted" min/max values then we can be even faster. (See `BoundaryOrder` for more details.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699767560



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       @gszadovszky I tried this: if the values in a page are <= the max value in the IN set, and >= the min value in the IN set, then the page might contain the values in the IN set. I am not sure if this is want you want so I only changed `In` for now. Please take a look. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717696646



##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax<T> {
-  private PrimitiveComparator comparator;
-  private Iterator<T> iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
-    this.comparator = comparator;
-    this.iterator = iterator;
-    getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable<T> iterable) {

Review comment:
       Fixed. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -47,43 +39,22 @@ public T getMax() {
     return max;
   }
 
-  private void getMinAndMax() {
-    while(iterator.hasNext())  {
-      T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable<T> iterable) {
+    iterable.forEach(element -> {
       if (max == null) {
         max = element;
-      } else if (max != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
-          (element instanceof Double &&
-             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))
+      } else if (element != null) {
+        if (comparator.compare(max, element) < 0) {

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717354327



##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax<T> {
-  private PrimitiveComparator comparator;
-  private Iterator<T> iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
-    this.comparator = comparator;
-    this.iterator = iterator;
-    getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable<T> iterable) {

Review comment:
       I think,  you should get warnings if you use a generic class without the generic type. `PrimitiveComparator<T>` or even `Comparator<T>` should work fine.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -47,43 +39,22 @@ public T getMax() {
     return max;
   }
 
-  private void getMinAndMax() {
-    while(iterator.hasNext())  {
-      T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable<T> iterable) {
+    iterable.forEach(element -> {
       if (max == null) {
         max = element;
-      } else if (max != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
-          (element instanceof Double &&
-             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))
+      } else if (element != null) {
+        if (comparator.compare(max, element) < 0) {

Review comment:
       nit: You may combine the two with an `&&`.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -47,43 +39,22 @@ public T getMax() {
     return max;
   }
 
-  private void getMinAndMax() {
-    while(iterator.hasNext())  {
-      T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable<T> iterable) {
+    iterable.forEach(element -> {
       if (max == null) {
         max = element;
-      } else if (max != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
-          (element instanceof Double &&
-             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))
+      } else if (element != null) {
+        if (comparator.compare(max, element) < 0) {
           max = element;
+        }
       }
       if (min == null) {
         min = element;
-      } else if (min != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)min, (Integer)element) > 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)min, (Binary)element) > 0) ||
-          (element instanceof Double &&
-            ((PrimitiveComparator<Double>)comparator).compare((Double)min, (Double)element) > 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)min, (Float)element) > 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)min, (Boolean)element) > 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long)min, (Long)element) > 0))
+      } else if (element != null) {
+        if (comparator.compare(min, element) > 0) {

Review comment:
       See above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] dbtsai commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
dbtsai commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r697021579



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r714379682



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
-      IntSet indexes = getMatchingIndexes(in);
-      return IndexIterator.filter(getPageCount(), indexes::contains);
+      Set<T> values = in.getValues();
+      TreeSet<T> myTreeSet = new TreeSet<>();
+      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      Iterator<T> it = values.iterator();
+      while(it.hasNext()) {
+        T value = it.next();
+        if (value != null) {
+          myTreeSet.add(value);
+        } else {
+          if (nullCounts == null) {
+            // Searching for nulls so if we don't have null related statistics we have to return all pages
+            return IndexIterator.all(getPageCount());
+          } else {
+            for (int i = 0; i < nullCounts.length; i++) {
+              if (nullCounts[i] > 0) {
+                matchingIndexes1.add(i);
+              }
+            }
+          }
+        }
+      }
+
+      IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
       @gszadovszky Sorry for the delay. I did the following changes:
   1. Removed `TreeSet` to avoid sort the whole set, and added a method to get max and min.
   2. Updated `StatisticsFilter` to have the similar range comparison.
   3. I actually didn't change `notIn` because I don't think range checking works with `notIn`: if not in the range, `notIn` is true, but if in the range, it doesn't mean `notIn` is false.  For example, if we have `1, 2, 3, 6, 7 8, 9, 10` and the `notIn` predicate has set values `4, 5`, `4, 5` is in the range of 1 to 10, but `1, 2, 3, 6, 7 8, 9, 10` doesn't contain `4, 5`. In `StatisticsFilter` I simply return `BLOCK_MIGHT_MATCH;` for `notIn`. I probably should return `IndexIterator.all(getPageCount());` in `ColumnIndexBuilder` to be consistent with `StatisticsFilter`.

##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -1,14 +1,14 @@
-/* 

Review comment:
       @gszadovszky I added a test in `TestRecordLevelFilters.java` to test the new methods in the generated class `IncrementallyUpdatedFilterPredicateBuilder`. I have added tests in `TestColumnIndexFiltering` and `TestBloomFiltering` in my original changes. Do I need more tests in these two files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708850129



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
-      IntSet indexes = getMatchingIndexes(in);
-      return IndexIterator.filter(getPageCount(), indexes::contains);
+      Set<T> values = in.getValues();
+      TreeSet<T> myTreeSet = new TreeSet<>();
+      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      Iterator<T> it = values.iterator();
+      while(it.hasNext()) {
+        T value = it.next();
+        if (value != null) {
+          myTreeSet.add(value);
+        } else {
+          if (nullCounts == null) {
+            // Searching for nulls so if we don't have null related statistics we have to return all pages
+            return IndexIterator.all(getPageCount());
+          } else {
+            for (int i = 0; i < nullCounts.length; i++) {
+              if (nullCounts[i] > 0) {
+                matchingIndexes1.add(i);
+              }
+            }
+          }
+        }
+      }
+
+      IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
       I will think more about what is a better way to implement this `visit(In<T> in)`. 

##########
File path: parquet-column/src/test/java/org/apache/parquet/internal/filter2/columnindex/TestColumnIndexFilter.java
##########
@@ -410,6 +410,7 @@ public void testFiltering() {
     Set<Integer> set5 = new HashSet<>();
     set5.add(7);
     set5.add(20);
+    System.out.println(in(intColumn("column5"), set5).toString());

Review comment:
       Sorry, removed.

##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -67,15 +67,18 @@ public void run() throws IOException {
     add("package org.apache.parquet.filter2.recordlevel;\n" +
         "\n" +
         "import java.util.List;\n" +
+        "import java.util.Set;\n" +
         "\n" +
         "import org.apache.parquet.hadoop.metadata.ColumnPath;\n" +
         "import org.apache.parquet.filter2.predicate.Operators.Eq;\n" +
         "import org.apache.parquet.filter2.predicate.Operators.Gt;\n" +
         "import org.apache.parquet.filter2.predicate.Operators.GtEq;\n" +
+      "import org.apache.parquet.filter2.predicate.Operators.In;\n" +

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r714956184



##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##########
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
       return BLOCK_MIGHT_MATCH;
     }
 
+    if (stats.isNumNullsSet()) {
+      if (stats.getNumNulls() == 0) {
+        if (values.contains(null) && values.size() == 1) return BLOCK_CANNOT_MATCH;
+      } else {
+        if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+      }
+    }
+
     // drop if all the element in value < min || all the element in value > max
-    return allElementCanBeDropped(stats, values, meta);
+    if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+      stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+      return BLOCK_MIGHT_MATCH;
+    }
+    else {
+      return BLOCK_CANNOT_MATCH;
+    }
   }
 
-  private <T extends Comparable<T>> Boolean allElementCanBeDropped(Statistics<T> stats, Set<T> values, ColumnChunkMetaData meta) {
-    for (T value : values) {
-      if (value != null) {
-        if (stats.compareMinToValue(value) <= 0 && stats.compareMaxToValue(value) >= 0)
-          return false;
-      } else {
-        // numNulls is not set. We don't know anything about the nulls in this chunk
-        if (!stats.isNumNullsSet()) {
-          return false;
-        }
-        if (hasNulls(meta)) {
-          return false;
-        }
+  private <T extends Comparable<T>> T getMaxOrMin(boolean isMax, Set<T> values) {

Review comment:
       I don't really like to have a boolean flag for min/max. I also think it would be faster if the min/max values would be searched in the same iteration. Also, it would be nice if we wouldn't have to copy-paste this method twice. What do you think about the following design?
   
   Having a separate class e.g. `MinMax` that have two `T` fields: `min` and `max`. This class can be created by passing an `Iterable<T>` and a `PrimitiveComparator` arguments. At instantiation `min` and `max` would be initialized so directly after creating min and max can be retrieved.

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -291,32 +291,29 @@ boolean isNullPage(int pageIndex) {
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
       Set<T> values = in.getValues();
-      TreeSet<T> myTreeSet = new TreeSet<>();
-      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
       Iterator<T> it = values.iterator();
       while(it.hasNext()) {
         T value = it.next();
-        if (value != null) {
-          myTreeSet.add(value);
-        } else {
+        if (value == null) {
           if (nullCounts == null) {
             // Searching for nulls so if we don't have null related statistics we have to return all pages
             return IndexIterator.all(getPageCount());
           } else {
             for (int i = 0; i < nullCounts.length; i++) {
               if (nullCounts[i] > 0) {
-                matchingIndexes1.add(i);
+                matchingIndexesForNull.add(i);
               }
             }
           }
         }
       }
 
-      IntSet matchingIndexes2 = new IntOpenHashSet();
-      IntSet matchingIndexes3 = new IntOpenHashSet();
+      IntSet matchingIndexesLessThanMax = new IntOpenHashSet();
+      IntSet matchingIndexesLargerThanMin = new IntOpenHashSet();

Review comment:
       I would suggest using `Greater` instead of `Larger`. That's the usual naming hence we have LT (LessThan) and GT (GreaterThan).

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +288,79 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override
+    public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
+      Set<T> values = in.getValues();
+      IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
+      Iterator<T> it = values.iterator();
+      while(it.hasNext()) {
+        T value = it.next();
+        if (value == null) {
+          if (nullCounts == null) {
+            // Searching for nulls so if we don't have null related statistics we have to return all pages
+            return IndexIterator.all(getPageCount());
+          } else {
+            for (int i = 0; i < nullCounts.length; i++) {
+              if (nullCounts[i] > 0) {
+                matchingIndexesForNull.add(i);
+              }
+            }
+          }
+        }
+      }
+
+      IntSet matchingIndexesLessThanMax = new IntOpenHashSet();
+      IntSet matchingIndexesLargerThanMin = new IntOpenHashSet();
+
+      T min = getMaxOrMin(false, values);
+      T max = getMaxOrMin(true, values);
+
+      // We don't want to iterate through each of the values in the IN set to compare,
+      // because the size of the IN set might be very large. Instead, we want to only
+      // compare the max and min value of the IN set to see if the page might contain the
+      // values in the IN set.
+      // If the values in a page are <= the max value in the IN set,

Review comment:
       This way it sounds like all the value in a page shall be <= than max and >= than min and it is not true. What about `If there might be values in a page that are <= ...`?

##########
File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java
##########
@@ -146,6 +147,33 @@ public void testAllFilter() throws Exception {
     assertEquals(new ArrayList<Group>(), found);
   }
 
+  @Test
+  public void testInFilter() throws Exception {
+    BinaryColumn name = binaryColumn("name");
+
+    HashSet<Binary> nameSet = new HashSet<>();
+    nameSet.add(Binary.fromString("thing2"));
+    nameSet.add(Binary.fromString("thing1"));
+    for (int i = 100; i < 200; i++) {
+      nameSet.add(Binary.fromString("p" + i));
+    }
+    FilterPredicate pred = in(name, nameSet);
+    List<Group> found = PhoneBookWriter.readFile(phonebookFile, FilterCompat.get(pred));
+
+    List<String> expectedNames = new ArrayList<>();
+    expectedNames.add("thing1");
+    expectedNames.add("thing2");
+    for (int i = 100; i < 200; i++) {
+      expectedNames.add("p" + i);
+    }
+
+    assertEquals(expectedNames.get(0), ((Group)(found.get(0))).getString("name", 0));
+    assertEquals(expectedNames.get(1), ((Group)(found.get(1))).getString("name", 0));
+    for (int i = 2; i < 102; i++) {
+      assertEquals(expectedNames.get(i), ((Group)(found.get(i))).getString("name", 0));
+    }

Review comment:
       This is not correct this way. You should validate that all the values returned by the reader fulfills the filter and there are no values left out. So, `"thing1"`, `"thing2"` and from `"p100"` to `"p199"` and nothing else.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##########
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
       return BLOCK_MIGHT_MATCH;
     }
 
+    if (stats.isNumNullsSet()) {
+      if (stats.getNumNulls() == 0) {
+        if (values.contains(null) && values.size() == 1) return BLOCK_CANNOT_MATCH;
+      } else {
+        if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+      }
+    }
+
     // drop if all the element in value < min || all the element in value > max
-    return allElementCanBeDropped(stats, values, meta);
+    if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+      stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+      return BLOCK_MIGHT_MATCH;
+    }
+    else {
+      return BLOCK_CANNOT_MATCH;
+    }
   }
 
-  private <T extends Comparable<T>> Boolean allElementCanBeDropped(Statistics<T> stats, Set<T> values, ColumnChunkMetaData meta) {
-    for (T value : values) {
-      if (value != null) {
-        if (stats.compareMinToValue(value) <= 0 && stats.compareMaxToValue(value) >= 0)
-          return false;
-      } else {
-        // numNulls is not set. We don't know anything about the nulls in this chunk
-        if (!stats.isNumNullsSet()) {
-          return false;
-        }
-        if (hasNulls(meta)) {
-          return false;
-        }
+  private <T extends Comparable<T>> T getMaxOrMin(boolean isMax, Set<T> values) {
+    T res = null;
+    for (T element : values) {
+      if (res == null) {
+        res = element;
+      } else if (isMax && res != null && element != null && res.compareTo(element) < 0) {

Review comment:
       Similarly to the other min/max search you need the proper `PrimitiveComparator` to compare. You can get it by calling `meta.getPrimitiveType().comparator()` in the `visit` method.

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -326,12 +323,27 @@ boolean isNullPage(int pageIndex) {
       // and >= the min value in the IN set, then the page might contain
       // the values in the IN set.
       getBoundaryOrder().ltEq(createValueComparator(max))
-        .forEachRemaining((int index) -> matchingIndexes2.add(index));
+        .forEachRemaining((int index) -> matchingIndexesLessThanMax.add(index));
       getBoundaryOrder().gtEq(createValueComparator(min))
-        .forEachRemaining((int index) -> matchingIndexes3.add(index));
-      matchingIndexes2.retainAll(matchingIndexes3);
-      matchingIndexes2.addAll(matchingIndexes1);  // add the matching null pages
-      return IndexIterator.filter(getPageCount(), pageIndex -> matchingIndexes2.contains(pageIndex));
+        .forEachRemaining((int index) -> matchingIndexesLargerThanMin.add(index));
+      matchingIndexesLessThanMax.retainAll(matchingIndexesLargerThanMin);
+      IntSet matchingIndex = matchingIndexesLessThanMax;
+      matchingIndex.addAll(matchingIndexesForNull);  // add the matching null pages
+      return IndexIterator.filter(getPageCount(), pageIndex -> matchingIndex.contains(pageIndex));
+    }
+
+    private <T extends Comparable<T>> T getMaxOrMin(boolean isMax, Set<T> values) {
+      T res = null;
+      for (T element : values) {
+        if (res == null) {
+          res = element;
+        } else if (isMax && res != null && element != null && res.compareTo(element) < 0) {

Review comment:
       Sorry, just realized that the `compareTo` method of the value won't work properly in all cases. For example comparing two `Binary` values depends on the logical type. The comparison is different for decimals and strings. There are also different comparisons for signed and unsigned integers.
   So, you need to use the proper `PrimitiveComparator` instance for the type. Here in `ColumnIndexBase` you already have an instance for that: `comparator`.

##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -1,14 +1,14 @@
-/* 

Review comment:
       `TestColumnIndexFiltering` is another test class different from `TestColumnIndexFilter`. The first one is a higher level test. It would be nice if you could add some tests there as well.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##########
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
       return BLOCK_MIGHT_MATCH;
     }
 
+    if (stats.isNumNullsSet()) {
+      if (stats.getNumNulls() == 0) {
+        if (values.contains(null) && values.size() == 1) return BLOCK_CANNOT_MATCH;
+      } else {
+        if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+      }
+    }
+
     // drop if all the element in value < min || all the element in value > max
-    return allElementCanBeDropped(stats, values, meta);
+    if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+      stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+      return BLOCK_MIGHT_MATCH;
+    }
+    else {

Review comment:
       Please merge lines.

##########
File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java
##########
@@ -146,6 +147,33 @@ public void testAllFilter() throws Exception {
     assertEquals(new ArrayList<Group>(), found);
   }
 
+  @Test
+  public void testInFilter() throws Exception {
+    BinaryColumn name = binaryColumn("name");
+
+    HashSet<Binary> nameSet = new HashSet<>();
+    nameSet.add(Binary.fromString("thing2"));
+    nameSet.add(Binary.fromString("thing1"));
+    for (int i = 100; i < 200; i++) {
+      nameSet.add(Binary.fromString("p" + i));
+    }

Review comment:
       I think you should also add some value to the set that are not in the file.

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
-      IntSet indexes = getMatchingIndexes(in);
-      return IndexIterator.filter(getPageCount(), indexes::contains);
+      Set<T> values = in.getValues();
+      TreeSet<T> myTreeSet = new TreeSet<>();
+      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      Iterator<T> it = values.iterator();
+      while(it.hasNext()) {
+        T value = it.next();
+        if (value != null) {
+          myTreeSet.add(value);
+        } else {
+          if (nullCounts == null) {
+            // Searching for nulls so if we don't have null related statistics we have to return all pages
+            return IndexIterator.all(getPageCount());
+          } else {
+            for (int i = 0; i < nullCounts.length; i++) {
+              if (nullCounts[i] > 0) {
+                matchingIndexes1.add(i);
+              }
+            }
+          }
+        }
+      }
+
+      IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
       The return types of `StatisticsFilter` and `ColumnIndexBuilder` are different because the first one simply returns whether the related row group can be dropped or not while the second one returns the indexes of the matching pages.
   
   For `noIn` I agree. However there is an edge case you can figure out something when the min and max values are the same. In this case all the values shall be the same in the row group / page so you can say something in the case of there is only one element in `notIn`. It sounds a very rare scenario so I am fine returning "might match".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716278171



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -326,12 +323,27 @@ boolean isNullPage(int pageIndex) {
       // and >= the min value in the IN set, then the page might contain
       // the values in the IN set.
       getBoundaryOrder().ltEq(createValueComparator(max))
-        .forEachRemaining((int index) -> matchingIndexes2.add(index));
+        .forEachRemaining((int index) -> matchingIndexesLessThanMax.add(index));
       getBoundaryOrder().gtEq(createValueComparator(min))
-        .forEachRemaining((int index) -> matchingIndexes3.add(index));
-      matchingIndexes2.retainAll(matchingIndexes3);
-      matchingIndexes2.addAll(matchingIndexes1);  // add the matching null pages
-      return IndexIterator.filter(getPageCount(), pageIndex -> matchingIndexes2.contains(pageIndex));
+        .forEachRemaining((int index) -> matchingIndexesLargerThanMin.add(index));
+      matchingIndexesLessThanMax.retainAll(matchingIndexesLargerThanMin);
+      IntSet matchingIndex = matchingIndexesLessThanMax;
+      matchingIndex.addAll(matchingIndexesForNull);  // add the matching null pages
+      return IndexIterator.filter(getPageCount(), pageIndex -> matchingIndex.contains(pageIndex));
+    }
+
+    private <T extends Comparable<T>> T getMaxOrMin(boolean isMax, Set<T> values) {
+      T res = null;
+      for (T element : values) {
+        if (res == null) {
+          res = element;
+        } else if (isMax && res != null && element != null && res.compareTo(element) < 0) {

Review comment:
       I changed to `PrimitiveComparator`. I checked `instanceof` and then cast to use the `PrimitiveComparator` for each of the type. Not sure if this is the correct way. Please take a look.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##########
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
       return BLOCK_MIGHT_MATCH;
     }
 
+    if (stats.isNumNullsSet()) {
+      if (stats.getNumNulls() == 0) {
+        if (values.contains(null) && values.size() == 1) return BLOCK_CANNOT_MATCH;
+      } else {
+        if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+      }
+    }
+
     // drop if all the element in value < min || all the element in value > max
-    return allElementCanBeDropped(stats, values, meta);
+    if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+      stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+      return BLOCK_MIGHT_MATCH;
+    }
+    else {
+      return BLOCK_CANNOT_MATCH;
+    }
   }
 
-  private <T extends Comparable<T>> Boolean allElementCanBeDropped(Statistics<T> stats, Set<T> values, ColumnChunkMetaData meta) {
-    for (T value : values) {
-      if (value != null) {
-        if (stats.compareMinToValue(value) <= 0 && stats.compareMaxToValue(value) >= 0)
-          return false;
-      } else {
-        // numNulls is not set. We don't know anything about the nulls in this chunk
-        if (!stats.isNumNullsSet()) {
-          return false;
-        }
-        if (hasNulls(meta)) {
-          return false;
-        }
+  private <T extends Comparable<T>> T getMaxOrMin(boolean isMax, Set<T> values) {

Review comment:
       Done. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -291,32 +291,29 @@ boolean isNullPage(int pageIndex) {
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
       Set<T> values = in.getValues();
-      TreeSet<T> myTreeSet = new TreeSet<>();
-      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
       Iterator<T> it = values.iterator();
       while(it.hasNext()) {
         T value = it.next();
-        if (value != null) {
-          myTreeSet.add(value);
-        } else {
+        if (value == null) {
           if (nullCounts == null) {
             // Searching for nulls so if we don't have null related statistics we have to return all pages
             return IndexIterator.all(getPageCount());
           } else {
             for (int i = 0; i < nullCounts.length; i++) {
               if (nullCounts[i] > 0) {
-                matchingIndexes1.add(i);
+                matchingIndexesForNull.add(i);
               }
             }
           }
         }
       }
 
-      IntSet matchingIndexes2 = new IntOpenHashSet();
-      IntSet matchingIndexes3 = new IntOpenHashSet();
+      IntSet matchingIndexesLessThanMax = new IntOpenHashSet();
+      IntSet matchingIndexesLargerThanMin = new IntOpenHashSet();

Review comment:
       Changed.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##########
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
       return BLOCK_MIGHT_MATCH;
     }
 
+    if (stats.isNumNullsSet()) {
+      if (stats.getNumNulls() == 0) {
+        if (values.contains(null) && values.size() == 1) return BLOCK_CANNOT_MATCH;
+      } else {
+        if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+      }
+    }
+
     // drop if all the element in value < min || all the element in value > max
-    return allElementCanBeDropped(stats, values, meta);
+    if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+      stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+      return BLOCK_MIGHT_MATCH;
+    }
+    else {

Review comment:
       Done. 

##########
File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java
##########
@@ -146,6 +147,33 @@ public void testAllFilter() throws Exception {
     assertEquals(new ArrayList<Group>(), found);
   }
 
+  @Test
+  public void testInFilter() throws Exception {
+    BinaryColumn name = binaryColumn("name");
+
+    HashSet<Binary> nameSet = new HashSet<>();
+    nameSet.add(Binary.fromString("thing2"));
+    nameSet.add(Binary.fromString("thing1"));
+    for (int i = 100; i < 200; i++) {
+      nameSet.add(Binary.fromString("p" + i));
+    }

Review comment:
       Added. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] viirya commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699604499



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       `str.substring(0, str.length() - 2)` is still `StringBuilder` operation. Seems fine?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699767560



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       @gszadovszky I tried this: if the values in a page are <= the max value in the IN set, and >= the min value in the IN set, then the page might contain the values in the IN set. I am not sure if this is want you want so I only changed `In` for now. Please take a look. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn

Review comment:
       Fixed. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
       Removed.

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Fixed. Thanks!

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;

Review comment:
       Yes, but just trying to follow the style at https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java#L150

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;
+      SetColumnFilterPredicate<?> that = (SetColumnFilterPredicate<?>) o;
+      return column.equals(that.column) && values.equals(that.values) && Objects.equals(toString, that.toString);

Review comment:
       Removed toString comparison




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716278313



##########
File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java
##########
@@ -146,6 +147,33 @@ public void testAllFilter() throws Exception {
     assertEquals(new ArrayList<Group>(), found);
   }
 
+  @Test
+  public void testInFilter() throws Exception {
+    BinaryColumn name = binaryColumn("name");
+
+    HashSet<Binary> nameSet = new HashSet<>();
+    nameSet.add(Binary.fromString("thing2"));
+    nameSet.add(Binary.fromString("thing1"));
+    for (int i = 100; i < 200; i++) {
+      nameSet.add(Binary.fromString("p" + i));
+    }
+    FilterPredicate pred = in(name, nameSet);
+    List<Group> found = PhoneBookWriter.readFile(phonebookFile, FilterCompat.get(pred));
+
+    List<String> expectedNames = new ArrayList<>();
+    expectedNames.add("thing1");
+    expectedNames.add("thing2");
+    for (int i = 100; i < 200; i++) {
+      expectedNames.add("p" + i);
+    }
+
+    assertEquals(expectedNames.get(0), ((Group)(found.get(0))).getString("name", 0));
+    assertEquals(expectedNames.get(1), ((Group)(found.get(1))).getString("name", 0));
+    for (int i = 2; i < 102; i++) {
+      assertEquals(expectedNames.get(i), ((Group)(found.get(i))).getString("name", 0));
+    }

Review comment:
       I added `assert(found.size() == 102)`. Since I have already checked that `found` contains `"thing1"`, `"thing2"` and from `"p100"` to `"p199"`, I think this assert size is sufficient to check if `found` doesn't contain anything else.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708260743



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
-      IntSet indexes = getMatchingIndexes(in);
-      return IndexIterator.filter(getPageCount(), indexes::contains);
+      Set<T> values = in.getValues();
+      TreeSet<T> myTreeSet = new TreeSet<>();
+      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      Iterator<T> it = values.iterator();
+      while(it.hasNext()) {
+        T value = it.next();
+        if (value != null) {
+          myTreeSet.add(value);
+        } else {
+          if (nullCounts == null) {
+            // Searching for nulls so if we don't have null related statistics we have to return all pages
+            return IndexIterator.all(getPageCount());
+          } else {
+            for (int i = 0; i < nullCounts.length; i++) {
+              if (nullCounts[i] > 0) {
+                matchingIndexes1.add(i);
+              }
+            }
+          }
+        }
+      }
+
+      IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
       Please choose better naming for the final implementation.

##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -1,14 +1,14 @@
-/* 

Review comment:
       We have to validate In and NotIn for the record level filtering as well. All the tests here are validating higher levels only. (Record level filtering is when you actually read the data and check the values one by one to return only those values that are fulfill the filter. This is done by the class `IncrementallyUpdatedFilterPredicateBuilder` generated by this one.) 
   
   In would suggest checking and extending the test classes `TestRecordLevelFilters`, `TestColumnIndexFiltering` and `TestBloomFiltering`. 

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java
##########
@@ -118,6 +120,45 @@ private ColumnChunkMetaData getColumnChunk(ColumnPath columnPath) {
     return BLOCK_MIGHT_MATCH;
   }
 
+  @Override
+  public <T extends Comparable<T>> Boolean visit(Operators.In<T> in) {
+    Set<T> values = in.getValues();
+
+    if (values.contains(null)) {
+      // the bloom filter bitset contains only non-null values so isn't helpful. this
+      // could check the column stats, but the StatisticsFilter is responsible
+      return BLOCK_MIGHT_MATCH;
+    }
+
+    Operators.Column<T> filterColumn = in.getColumn();
+    ColumnChunkMetaData meta = getColumnChunk(filterColumn.getColumnPath());
+    if (meta == null) {
+      // the column isn't in this file so all values are null, but the value
+      // must be non-null because of the above check.
+      return BLOCK_CANNOT_MATCH;
+    }
+
+    try {
+      BloomFilter bloomFilter = bloomFilterReader.readBloomFilter(meta);
+      if (bloomFilter != null) {
+        for (T value : values) {
+          if (bloomFilter.findHash(bloomFilter.hash(value))) {
+            return BLOCK_MIGHT_MATCH;
+          }
+        }
+        return BLOCK_CANNOT_MATCH;
+      }
+    } catch (RuntimeException e) {
+      LOG.warn(e.getMessage());
+    }

Review comment:
       Why is this necessary? Shouldn't we simply allow throwing though a RuntimeExeption from bloom filters?

##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -67,15 +67,18 @@ public void run() throws IOException {
     add("package org.apache.parquet.filter2.recordlevel;\n" +
         "\n" +
         "import java.util.List;\n" +
+        "import java.util.Set;\n" +
         "\n" +
         "import org.apache.parquet.hadoop.metadata.ColumnPath;\n" +
         "import org.apache.parquet.filter2.predicate.Operators.Eq;\n" +
         "import org.apache.parquet.filter2.predicate.Operators.Gt;\n" +
         "import org.apache.parquet.filter2.predicate.Operators.GtEq;\n" +
+      "import org.apache.parquet.filter2.predicate.Operators.In;\n" +

Review comment:
       Please correct indention.

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -250,27 +250,16 @@ public Eq(Column<T> column, T value) {
     }
   }
 
-  // base class for In and NotIn
+  // base class for In and NotIn. In is used to filter data based on a list of values. NotIn is used to filter data that
+  // are not in the list of values.

Review comment:
       Since you have added proper comments to a public class I would suggest using the javadoc style comment.

##########
File path: parquet-column/src/test/java/org/apache/parquet/internal/filter2/columnindex/TestColumnIndexFilter.java
##########
@@ -410,6 +410,7 @@ public void testFiltering() {
     Set<Integer> set5 = new HashSet<>();
     set5.add(7);
     set5.add(20);
+    System.out.println(in(intColumn("column5"), set5).toString());

Review comment:
       I guess this is not for the final code.

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       I think the concept is good. You may do a similar way for `StatisticsFilter` so you do not need to check all the values in the set but the min/max values of it. In this case you might search for min/max values once inside `In` and `NotIn` instead of doing so for the two filters.
   I am not sure what is the best way to get the values of the set in the API. For example if it wasn't be a `Set` but a `List` or a vararg then you would be able to build up your own `TreeSet` by default. Or you may expect a `SortedSet` directly.
   In the other hand you do not need the whole set to be sorted. You only need the min and max and I think it is faster to search them than sorting the whole.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716278127



##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -1,14 +1,14 @@
-/* 

Review comment:
       Added. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708974511



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -250,27 +250,16 @@ public Eq(Column<T> column, T value) {
     }
   }
 
-  // base class for In and NotIn
+  // base class for In and NotIn. In is used to filter data based on a list of values. NotIn is used to filter data that
+  // are not in the list of values.

Review comment:
       It is a nit but javadoc style comments in java starts with `/**`. If the simple `/*` one is used it won't be generated to the javadocs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
shangxinli commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698606499



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
       I see you have a 'toString' to cache but do we see generally this is reused multiple times? If no, proactively converting to string will be a waste.  

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Would it be possible to merge lines 272 and 273 into the above code of that building? the string? String operations sometimes consume a lot of memory like this. 

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;

Review comment:
       I guess you can just 'return this.getClass() == o.getClass()'

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;
+      SetColumnFilterPredicate<?> that = (SetColumnFilterPredicate<?>) o;
+      return column.equals(that.column) && values.equals(that.values) && Objects.equals(toString, that.toString);

Review comment:
       Is toString comparison still needed here? It seems toString have (values and class). You can just compare class here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
shangxinli commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698598441



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn

Review comment:
       Have a better comment since it is public method  

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
       I see you have a 'toString' to cache but do we see generally this is reused multiple times? If no, proactively converting to string will be a waste.  

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Would it be possible to merge lines 272 and 273 into the above code of that building? the string? String operations sometimes consume a lot of memory like this. 

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;

Review comment:
       I guess you can just 'return this.getClass() == o.getClass()'

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";
+    }
+
+    public Column<T> getColumn() {
+      return column;
+    }
+
+    public Set<T> getValues() {
+      return values;
+    }
+
+    @Override
+    public String toString() {
+      return toString;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+      if (this == o) return true;
+      if (o == null || getClass() != o.getClass()) return false;
+      SetColumnFilterPredicate<?> that = (SetColumnFilterPredicate<?>) o;
+      return column.equals(that.column) && values.equals(that.values) && Objects.equals(toString, that.toString);

Review comment:
       Is toString comparison still needed here? It seems toString have (values and class). You can just compare class here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708850491



##########
File path: parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -1,14 +1,14 @@
-/* 

Review comment:
       I will think about how to achieve this. Thanks for the suggestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] viirya commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699605818



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable {
+    private final Column<T> column;
+    private final Set<T> values;
+    private final String toString;
+
+    protected SetColumnFilterPredicate(Column<T> column, Set<T> values) {
+      this.column = Objects.requireNonNull(column, "column cannot be null");
+      this.values = Objects.requireNonNull(values, "values cannot be null");
+      checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate shouldn't be empty!");
+
+      String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+      StringBuilder str = new StringBuilder();
+      int iter = 0;
+      for (T value : values) {
+        if (iter >= 100) break;
+        str.append(value).append(", ");
+        iter++;
+      }
+      String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 2) : str + "...";
+      this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + valueStr + ")";

Review comment:
       Maybe we can replace line 273 with `StringBuilder` operation too?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
shangxinli commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698598441



##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##########
@@ -247,6 +250,80 @@ public int hashCode() {
     }
   }
 
+  // base class for In and NotIn

Review comment:
       Have a better comment since it is public method  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717354327



##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax<T> {
-  private PrimitiveComparator comparator;
-  private Iterator<T> iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
-    this.comparator = comparator;
-    this.iterator = iterator;
-    getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable<T> iterable) {

Review comment:
       I think,  you should get warnings if you use a generic class without the generic type. `PrimitiveComparator<T>` or even `Comparator<T>` should work fine.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -47,43 +39,22 @@ public T getMax() {
     return max;
   }
 
-  private void getMinAndMax() {
-    while(iterator.hasNext())  {
-      T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable<T> iterable) {
+    iterable.forEach(element -> {
       if (max == null) {
         max = element;
-      } else if (max != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
-          (element instanceof Double &&
-             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))
+      } else if (element != null) {
+        if (comparator.compare(max, element) < 0) {

Review comment:
       nit: You may combine the two with an `&&`.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -47,43 +39,22 @@ public T getMax() {
     return max;
   }
 
-  private void getMinAndMax() {
-    while(iterator.hasNext())  {
-      T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable<T> iterable) {
+    iterable.forEach(element -> {
       if (max == null) {
         max = element;
-      } else if (max != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
-          (element instanceof Double &&
-             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))
+      } else if (element != null) {
+        if (comparator.compare(max, element) < 0) {
           max = element;
+        }
       }
       if (min == null) {
         min = element;
-      } else if (min != null && element != null) {
-        if ((element instanceof Integer &&
-          ((PrimitiveComparator<Integer>)comparator).compare((Integer)min, (Integer)element) > 0) ||
-          (element instanceof Binary &&
-            ((PrimitiveComparator<Binary>)comparator).compare((Binary)min, (Binary)element) > 0) ||
-          (element instanceof Double &&
-            ((PrimitiveComparator<Double>)comparator).compare((Double)min, (Double)element) > 0) ||
-          (element instanceof Float &&
-             ((PrimitiveComparator<Float>)comparator).compare((Float)min, (Float)element) > 0) ||
-          (element instanceof Boolean &&
-            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)min, (Boolean)element) > 0) ||
-          (element instanceof Long &&
-            ((PrimitiveComparator<Long>)comparator).compare((Long)min, (Long)element) > 0))
+      } else if (element != null) {
+        if (comparator.compare(min, element) > 0) {

Review comment:
       See above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716474012



##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax<T> {
+  private PrimitiveComparator comparator;
+  private Iterator<T> iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {

Review comment:
       I would expect an `Iterable` instead of an `Iterator`. This way the Set can be passed directly.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax<T> {
+  private PrimitiveComparator comparator;
+  private Iterator<T> iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
+    this.comparator = comparator;
+    this.iterator = iterator;
+    getMinAndMax();
+  }
+
+  public T getMin() {
+    return min;
+  }
+
+  public T getMax() {
+    return max;
+  }
+
+  private void getMinAndMax() {
+    while(iterator.hasNext())  {
+      T element = iterator.next();
+      if (max == null) {
+        max = element;
+      } else if (max != null && element != null) {

Review comment:
       You are already in the else path so do not need to check for `max != null`. 

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax<T> {
+  private PrimitiveComparator comparator;
+  private Iterator<T> iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
+    this.comparator = comparator;
+    this.iterator = iterator;

Review comment:
       You do not need to store the iterator and comparator instances since they are only used at initialization. I would simply pass them to `getMinAndMax`.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.

Review comment:
       If you are more general in implementation by using an Iterator (or Iterable) you should use the same in the comments.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax<T> {
+  private PrimitiveComparator comparator;
+  private Iterator<T> iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {

Review comment:
       ... and you may use foreach instead of a while loop on the iterator.

##########
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax<T> {
+  private PrimitiveComparator comparator;
+  private Iterator<T> iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator<T> iterator) {
+    this.comparator = comparator;
+    this.iterator = iterator;
+    getMinAndMax();
+  }
+
+  public T getMin() {
+    return min;
+  }
+
+  public T getMax() {
+    return max;
+  }
+
+  private void getMinAndMax() {
+    while(iterator.hasNext())  {
+      T element = iterator.next();
+      if (max == null) {
+        max = element;
+      } else if (max != null && element != null) {
+        if ((element instanceof Integer &&
+          ((PrimitiveComparator<Integer>)comparator).compare((Integer)max, (Integer)element) < 0) ||
+          (element instanceof Binary &&
+            ((PrimitiveComparator<Binary>)comparator).compare((Binary)max, (Binary)element) < 0) ||
+          (element instanceof Double &&
+             ((PrimitiveComparator<Double>)comparator).compare((Double)max, (Double)element) < 0) ||
+          (element instanceof Float &&
+             ((PrimitiveComparator<Float>)comparator).compare((Float)max, (Float)element) < 0) ||
+          (element instanceof Boolean &&
+            ((PrimitiveComparator<Boolean>)comparator).compare((Boolean)max, (Boolean)element) < 0) ||
+          (element instanceof Long &&
+            ((PrimitiveComparator<Long>)comparator).compare((Long) max, (Long)element) < 0))

Review comment:
       You do not need to check the instance and cast. You may expect a `PrimitiveComparator<T>` and use it directly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky merged pull request #923: [PARQUET-1968] FilterApi support In predicate

Posted by GitBox <gi...@apache.org>.
gszadovszky merged pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org