You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/11 11:53:32 UTC

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3780: Implement parquet page-level skipping with column index, using min/ma…

Ted-Jiang commented on code in PR #3780:
URL: https://github.com/apache/arrow-datafusion/pull/3780#discussion_r992218989


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -785,6 +902,57 @@ impl<'a> PruningStatistics for RowGroupPruningStatistics<'a> {
     }
 }
 
+impl<'a> PruningStatistics for PagesPruningStatistics<'a> {
+    fn min_values(&self, column: &Column) -> Option<ArrayRef> {
+        get_min_max_values_form_page_index!(self, column, min)
+    }
+
+    fn max_values(&self, column: &Column) -> Option<ArrayRef> {
+        get_min_max_values_form_page_index!(self, column, max)
+    }
+
+    fn num_containers(&self) -> usize {
+        self.offset_indexes.get(self.col_id).unwrap().len()

Review Comment:
   @alamb PTAL, 🤔 for now `num_containers ` only return on values
   I think we should modify it to
   ```
   fn num_containers(&self, column: &Column) -> usize {
   ```
   because each column chunk in one row group has different page numbers



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -460,6 +498,20 @@ impl FileOpener for ParquetOpener {
     }
 }
 
+// Check PruningPredicates just work on one column.
+fn check_page_index_push_down_valid(predicate: &Option<PruningPredicate>) -> bool {
+    if let Some(predicate) = predicate {
+        // for now we only support pushDown on one col, because each col may have different page numbers, its hard to get
+        // `num_containers` from `PruningStatistics`.
+        let cols = predicate.need_input_columns_ids();
+        //Todo more specific rules

Review Comment:
   Now, cause of `num_containers `  we only support only one col. In the future, we could add more specific rules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org