You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/17 22:01:07 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3859: Factorize common AND factors out of OR predicates to support filterPu…

alamb commented on code in PR #3859:
URL: https://github.com/apache/arrow-datafusion/pull/3859#discussion_r997541658


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -119,6 +119,121 @@ fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).

Review Comment:
   💯  for this comment



##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -952,6 +953,30 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn filter_keep_partial_agg() -> Result<()> {
+        let table_scan = test_table_scan()?;
+        let f1 = col("c").eq(lit(1i64)).and(col("b").gt(lit(2i64)));
+        let f2 = col("c").eq(lit(1i64)).and(col("b").gt(lit(3i64)));
+        let filter = f1.or(f2);
+        let plan = LogicalPlanBuilder::from(table_scan)
+            .aggregate(vec![col("a")], vec![sum(col("b")).alias("b")])?
+            .filter(filter)?
+            .build()?;
+        // filter of aggregate is after aggregation since they are non-commutative
+        // (c =1 AND b > 2) OR (c = 1 AND b > 3)
+        // rewrite to CNF
+        // (c = 1 OR c = 1) [can pushDown] AND (c = 1 OR b > 3) AND (b > 2 OR C = 1) AND (b > 2 OR b > 3)
+
+        let expected = "\
+            Filter: test.c = Int64(1) OR b > Int64(3) AND b > Int64(2) OR test.c = Int64(1) AND b > Int64(2) OR b > Int64(3)\
+            \n  Aggregate: groupBy=[[test.a]], aggr=[[SUM(test.b) AS b]]\
+            \n    Filter: test.c = Int64(1) OR test.c = Int64(1)\

Review Comment:
   I think this should be a rule in https://github.com/apache/arrow-datafusion/blob/37fe938261636bb7710b26cf26e2bbd010e1dbf0/datafusion/optimizer/src/simplify_expressions.rs#L264
   
   That pass gets run several times
   
   I recommend we file a follow on ticket for that particular rewrite (the same thing applies to `a AND a` in addition to `a OR a`



##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -529,7 +529,8 @@ fn optimize(plan: &LogicalPlan, mut state: State) -> Result<LogicalPlan> {
         }
         LogicalPlan::Analyze { .. } => push_down(&state, plan),
         LogicalPlan::Filter(filter) => {
-            let predicates = utils::split_conjunction(filter.predicate());
+            let filter_cnf = utils::CnfHelper::new().rewrite_to_cnf_impl(filter.predicate());
+            let predicates = utils::split_conjunction(&filter_cnf);

Review Comment:
   I recommend putting it in SimplifyExpressions -- specifically call it from https://github.com/apache/arrow-datafusion/blob/37fe938261636bb7710b26cf26e2bbd010e1dbf0/datafusion/optimizer/src/simplify_expressions.rs#L264 somehow
   
   You could also reuse the testing framework in that file as well



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -119,6 +119,121 @@ fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_optimizer::utils::CnfHelper;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// // use split_conjunction_owned to split them
+/// assert_eq!(CnfHelper::new().rewrite_to_cnf_impl(&expr), expect);
+/// ```
+///
+pub struct CnfHelper {
+    max_count: usize,
+    current_count: usize,
+    exprs: Vec<Expr>,
+    original_expr: Option<Expr>,
+}
+
+impl CnfHelper {
+    pub fn new() -> Self {
+        CnfHelper {
+            max_count: 100,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    pub fn new_with_max_count(max_count: usize) -> Self {
+        CnfHelper {
+            max_count,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    fn increment_and_check_overload(&mut self) -> bool {
+        self.current_count += 1;
+        self.current_count >= self.max_count
+    }
+
+    pub fn rewrite_to_cnf_impl(&mut self, expr: &Expr) -> Expr {

Review Comment:
   I think if you used the `ExprRewriter` framework here you would avoid many of the `clone`s



##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -952,6 +953,30 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn filter_keep_partial_agg() -> Result<()> {
+        let table_scan = test_table_scan()?;
+        let f1 = col("c").eq(lit(1i64)).and(col("b").gt(lit(2i64)));
+        let f2 = col("c").eq(lit(1i64)).and(col("b").gt(lit(3i64)));
+        let filter = f1.or(f2);
+        let plan = LogicalPlanBuilder::from(table_scan)
+            .aggregate(vec![col("a")], vec![sum(col("b")).alias("b")])?
+            .filter(filter)?
+            .build()?;
+        // filter of aggregate is after aggregation since they are non-commutative
+        // (c =1 AND b > 2) OR (c = 1 AND b > 3)
+        // rewrite to CNF
+        // (c = 1 OR c = 1) [can pushDown] AND (c = 1 OR b > 3) AND (b > 2 OR C = 1) AND (b > 2 OR b > 3)
+
+        let expected = "\

Review Comment:
   I think it is a good improvement for sure -- thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org