You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/08/25 16:44:20 UTC
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #923: [PARQUET-1968] FilterApi support In predicate

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r695929108



##########
File path: pom.xml
##########
@@ -478,6 +478,7 @@
                 change to fix a integer overflow issue.
                 TODO: remove this after Parquet 1.13 release -->
               <exclude>org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize</exclude>
+              <exclude>org.apache.parquet.filter2.predicate.FilterPredicate</exclude>

Review comment:
       However, it is not what filter2 API is designed for, technically it is possible to the user to implement this interface and by adding new methods to it we really break our API.
   What do you think about adding default implementations by throwing `UnsupportedOperationException`? This way we do not need to add this class here.

##########
File path: parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##########
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
     }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
       I know it is quite a mass to add new stuff into `IncrementallyUpdatedFilterPredicateBuilder` (via `IncrementallyUpdatedFilterPredicateGenerator`) but the current way you are checking the values one-by-one while you have a hashset. It could be faster if the related visit methods would be generated in `IncrementallyUpdatedFilterPredicateBuilder` just like for the other predicates and hash search algorithm would be used. What do you think?

##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       I am not sure how it effects performance in real life (e.g. how many values are in the set of the in/notIn predicate and how many pages do we have) but it can be done in smarter way. Column indexes are min/max values for the pages. If we sort the values in the set we can do two logarithmic searches (one for min then for max) to decide if a page might contain that value. If the column indexes themselves have "sorted" min/max values then we can be even faster. (See `BoundaryOrder` for more details.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org