You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/05/10 10:52:15 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6317: [parquet] Avoid read parquet index when there is no filter pushdown.

alamb commented on code in PR #6317:
URL: https://github.com/apache/arrow-datafusion/pull/6317#discussion_r1189728277


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -572,6 +575,15 @@ impl FileOpener for ParquetOpener {
     }
 }
 
+fn should_enable_page_index(
+    enable_page_index: bool,
+    page_pruning_predicate: &Option<Arc<PagePruningPredicate>>,
+) -> bool {
+    enable_page_index
+        && page_pruning_predicate.is_some()
+        && page_pruning_predicate.as_ref().unwrap().filter_number() > 0

Review Comment:
   I think we should avoid the `unwrap()` here 
   
   Perhaps something like
   
   ```suggestion
           && page_pruning_predicate.as_ref().map(|p| p.filter_number() > 0).unwrap_or(false)
   ```



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -222,6 +222,11 @@ impl PagePruningPredicate {
         file_metrics.page_index_rows_filtered.add(total_skip);
         Ok(Some(final_selection))
     }
+
+    /// Returns the number of filters in the [`PagePruningPredicate`]
+    pub fn filter_number(&self) -> usize {

Review Comment:
   Given the use of this code, I think using `is_empty()` rather than `len()` would be clearer



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -714,3 +715,41 @@ async fn prune_decimal_in_list() {
     )
     .await;
 }
+
+#[tokio::test]
+async fn without_pushdown_filter() {
+    let mut context = ContextWithParquet::new(Scenario::Timestamps, Page).await;
+
+    let output1 = context.query("SELECT * FROM t").await;
+
+    let mut context = ContextWithParquet::new(Scenario::Timestamps, Page).await;
+
+    let output2 = context
+        .query("SELECT * FROM t where nanos < to_timestamp('2023-01-02 01:01:11Z')")
+        .await;
+
+    let bytes_scanned_without_filter = cast_count_metric(
+        output1
+            .parquet_metrics
+            .sum_by_name("bytes_scanned")
+            .unwrap(),
+    )
+    .unwrap();
+    let bytes_scanned_with_filter = cast_count_metric(
+        output2
+            .parquet_metrics
+            .sum_by_name("bytes_scanned")
+            .unwrap(),
+    )
+    .unwrap();
+
+    // Without filter will not read pageIndex.
+    assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);

Review Comment:
   nice test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org