You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/21 18:33:52 UTC
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2282: Enable filter pushdown when using In_list on parquet

alamb commented on code in PR #2282:
URL: https://github.com/apache/arrow-datafusion/pull/2282#discussion_r855486216


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -688,6 +688,29 @@ fn build_predicate_expression(
                 return Ok(unhandled);
             }
         }
+        Expr::InList {
+            expr,
+            list,
+            negated,
+        } => {
+            if !list.is_empty() {

Review Comment:
   I also recommend putting a limit on the size of the INLIST when the rewrite occurs otherwise we'll likely explode in evaluation  -- for example don't allow `IN (1,2,.......1000000000)`  -- maybe a limit if 10 or 20 would be good?



##########
datafusion/core/tests/parquet_pruning.rs:
##########
@@ -403,6 +403,21 @@ async fn prune_f64_complex_expr_subtract() {
     assert_eq!(output.result_rows, 9, "{}", output.description());
 }
 
+#[tokio::test]
+async fn prune_int32_eq_in_list() {
+    // resulrt of sql "SELECT * FROM t where in (1)"

Review Comment:
   ```suggestion
       // result of sql "SELECT * FROM t where in (1)"
   ```



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -1340,6 +1363,26 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn row_group_predicate_in_list() -> Result<()> {
+        let schema = Schema::new(vec![
+            Field::new("c1", DataType::Int32, false),
+            Field::new("c2", DataType::Int32, false),
+        ]);
+        // test c1 in(1, 2, 3)
+        let expr = Expr::InList {
+            expr: Box::new(col("c1")),
+            list: vec![lit(1), lit(2), lit(3)],
+            negated: false,
+        };
+        let expected_expr = "#c1_min <= Int32(1) AND Int32(1) <= #c1_max OR #c1_min <= Int32(2) AND Int32(2) <= #c1_max OR #c1_min <= Int32(3) AND Int32(3) <= #c1_max";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+

Review Comment:
   I recommend some other test cases:
   
   1. `c1 IN ()` as there is code checking this case
   2. `c1 IN (1)`
   3. `c1 NOT IN (1,2,3)`



##########
datafusion/core/tests/parquet_pruning.rs:
##########
@@ -403,6 +403,21 @@ async fn prune_f64_complex_expr_subtract() {
     assert_eq!(output.result_rows, 9, "{}", output.description());
 }
 
+#[tokio::test]
+async fn prune_int32_eq_in_list() {
+    // resulrt of sql "SELECT * FROM t where in (1)"
+    let output = ContextWithParquet::new(Scenario::Int32)
+        .await
+        .query("SELECT * FROM t where i in (1)")
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_groups_pruned(), Some(3));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}

Review Comment:
   maybe it is worth a negative test too -- like `SELECT * from t where i in (1000)` and expect nothing is pruned



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org