You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/20 07:55:23 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new pull request, #3903: Factorize common AND factors out of OR predicates to support filterPu…

Ted-Jiang opened a new pull request, #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903

   …shDown as possible
   
   Signed-off-by: yangjiang <ya...@ebay.com>
   
   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3858 .
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1008043476


##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: (n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE")) AND (n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY"))

Review Comment:
   > think there are only 25 rows in the NATION table in TPCH,
   😂 



##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: (n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE")) AND (n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY"))

Review Comment:
   > think there are only 25 rows in the NATION table in TPCH,
   
   😂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1007258449


##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: (n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE")) AND (n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY"))

Review Comment:
   I've also wanted to check TPC-H (it shouldn't affect, but just to see if there is an unexpected regression). It seems like there aren't any regression (`873.27 ms` vs  `878.54 ms`, only noise) 🚀 



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -655,4 +828,135 @@ mod tests {
             "mismatch rewriting expr_from: {expr_from} to {rewrite_to}"
         )
     }
+
+    #[test]
+    fn test_permutations() {
+        assert_eq!(make_permutations(vec![]), vec![] as Vec<Vec<Expr>>)
+    }
+
+    #[test]
+    fn test_permutations_one() {
+        // [[a]] --> [[a]]
+        assert_eq!(
+            make_permutations(vec![vec![col("a")]]),
+            vec![vec![col("a")]]
+        )
+    }
+
+    #[test]
+    fn test_permutations_two() {
+        // [[a, b]] --> [[a], [b]]
+        assert_eq!(
+            make_permutations(vec![vec![col("a"), col("b")]]),
+            vec![vec![col("a")], vec![col("b")]]
+        )
+    }
+
+    #[test]
+    fn test_permutations_two_and_one() {
+        // [[a, b], [c]] --> [[a, c], [b, c]]
+        assert_eq!(
+            make_permutations(vec![vec![col("a"), col("b")], vec![col("c")]]),
+            vec![vec![col("a"), col("c")], vec![col("b"), col("c")]]
+        )
+    }
+
+    #[test]
+    fn test_permutations_two_and_one_and_two() {
+        // [[a, b], [c], [d, e]] --> [[a, c, d], [a, c, e], [b, c, d], [b, c, e]]

Review Comment:
   These are super useful, thanks 💯 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1293869326

   I'll take a look at it tonight, thanks for the ping @alamb!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1006373803


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+/// Splits an binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// See [`split_binary_owned`] for more details and an example.
+pub fn split_binary(expr: &Expr, op: Operator) -> Vec<&Expr> {
+    split_binary_impl(expr, op, vec![])
+}
+
+fn split_binary_impl<'a>(
+    expr: &'a Expr,
+    operator: Operator,
+    mut exprs: Vec<&'a Expr>,
+) -> Vec<&'a Expr> {
+    match expr {
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if *op == operator => {
+            let exprs = split_binary_impl(left, operator, exprs);
+            split_binary_impl(right, operator, exprs)
+        }
+        Expr::Alias(expr, _) => split_binary_impl(expr, operator, exprs),
+        other => {
+            exprs.push(other);
+            exprs
+        }
+    }
+}
+
+/// Given a list of lists of [`Expr`]s, returns a list of lists of
+/// [`Expr`]s of expressions where there is one expression from each
+/// from each of the input expressions
+///
+/// For example, given the input `[[a, b], [c], [d, e]]` returns
+/// `[a, c, d], [a, c, e], [b, c, d], [b, c, e]]`.
+fn permutations(mut exprs: VecDeque<Vec<&Expr>>) -> Vec<Vec<&Expr>> {
+    let first = if let Some(first) = exprs.pop_front() {
+        first
+    } else {
+        return vec![];
+    };
+
+    // base case:
+    if exprs.is_empty() {
+        first.into_iter().map(|e| vec![e]).collect()
+    } else {
+        first
+            .into_iter()
+            .flat_map(|expr| {
+                permutations(exprs.clone())
+                    .into_iter()
+                    .map(|expr_list| {
+                        // Create [expr, ...] for each permutation
+                        std::iter::once(expr)
+                            .chain(expr_list.into_iter())
+                            .collect::<Vec<&Expr>>()
+                    })
+                    .collect::<Vec<Vec<&Expr>>>()
+            })
+            .collect()
+    }
+}
+
+const MAX_CNF_REWRITE_CONJUNCTS: usize = 10;
+
+/// Tries to convert an expression to conjunctive normal form (CNF).
+///
+/// Does not convert the expression if the total number of conjuncts
+/// (exprs ANDed together) would exceed [`MAX_CNF_REWRITE_CONJUNCTS`].
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+///
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+///
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_optimizer::utils::cnf_rewrite;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// assert_eq!(expect, cnf_rewrite(expr));
+/// ```
+pub fn cnf_rewrite(expr: Expr) -> Expr {
+    // Find all exprs joined by OR
+    let disjuncts = split_binary(&expr, Operator::Or);
+
+    // For each expr, split now on AND
+    // A OR B OR C --> split each A, B and C
+    let disjunct_conjuncts: VecDeque<Vec<&Expr>> = disjuncts
+        .into_iter()
+        .map(|e| split_binary(e, Operator::And))
+        .collect::<VecDeque<_>>();
+
+    // Decide if we want to distribute the clauses. Heuristic is
+    // chosen to avoid creating huge predicates
+    let num_conjuncts = disjunct_conjuncts
+        .iter()
+        .fold(1usize, |sz, exprs| sz.saturating_mul(exprs.len()));
+
+    if disjunct_conjuncts.iter().any(|exprs| exprs.len() > 1)
+        && num_conjuncts < MAX_CNF_REWRITE_CONJUNCTS
+    {
+        let or_clauses = permutations(disjunct_conjuncts)
+            .into_iter()
+            // form the OR clauses( A OR B OR C ..)
+            .map(|exprs| disjunction(exprs.into_iter().cloned()).unwrap());

Review Comment:
   seems we only clone the exprs once here🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1006365981


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+/// Splits an binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// See [`split_binary_owned`] for more details and an example.
+pub fn split_binary(expr: &Expr, op: Operator) -> Vec<&Expr> {
+    split_binary_impl(expr, op, vec![])
+}
+
+fn split_binary_impl<'a>(
+    expr: &'a Expr,
+    operator: Operator,
+    mut exprs: Vec<&'a Expr>,
+) -> Vec<&'a Expr> {
+    match expr {
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if *op == operator => {
+            let exprs = split_binary_impl(left, operator, exprs);
+            split_binary_impl(right, operator, exprs)
+        }
+        Expr::Alias(expr, _) => split_binary_impl(expr, operator, exprs),
+        other => {
+            exprs.push(other);
+            exprs
+        }
+    }
+}
+
+/// Given a list of lists of [`Expr`]s, returns a list of lists of
+/// [`Expr`]s of expressions where there is one expression from each
+/// from each of the input expressions
+///
+/// For example, given the input `[[a, b], [c], [d, e]]` returns
+/// `[a, c, d], [a, c, e], [b, c, d], [b, c, e]]`.
+fn permutations(mut exprs: VecDeque<Vec<&Expr>>) -> Vec<Vec<&Expr>> {
+    let first = if let Some(first) = exprs.pop_front() {
+        first
+    } else {
+        return vec![];
+    };
+
+    // base case:
+    if exprs.is_empty() {
+        first.into_iter().map(|e| vec![e]).collect()
+    } else {
+        first
+            .into_iter()
+            .flat_map(|expr| {
+                permutations(exprs.clone())
+                    .into_iter()
+                    .map(|expr_list| {
+                        // Create [expr, ...] for each permutation
+                        std::iter::once(expr)
+                            .chain(expr_list.into_iter())
+                            .collect::<Vec<&Expr>>()
+                    })
+                    .collect::<Vec<Vec<&Expr>>>()
+            })
+            .collect()
+    }
+}
+
+const MAX_CNF_REWRITE_CONJUNCTS: usize = 10;
+
+/// Tries to convert an expression to conjunctive normal form (CNF).
+///
+/// Does not convert the expression if the total number of conjuncts
+/// (exprs ANDed together) would exceed [`MAX_CNF_REWRITE_CONJUNCTS`].
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+///
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+///
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_optimizer::utils::cnf_rewrite;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// assert_eq!(expect, cnf_rewrite(expr));
+/// ```
+pub fn cnf_rewrite(expr: Expr) -> Expr {
+    // Find all exprs joined by OR
+    let disjuncts = split_binary(&expr, Operator::Or);
+
+    // For each expr, split now on AND
+    // A OR B OR C --> split each A, B and C
+    let disjunct_conjuncts: VecDeque<Vec<&Expr>> = disjuncts
+        .into_iter()
+        .map(|e| split_binary(e, Operator::And))
+        .collect::<VecDeque<_>>();
+
+    // Decide if we want to distribute the clauses. Heuristic is
+    // chosen to avoid creating huge predicates
+    let num_conjuncts = disjunct_conjuncts

Review Comment:
   Nice check!



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+/// Splits an binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// See [`split_binary_owned`] for more details and an example.
+pub fn split_binary(expr: &Expr, op: Operator) -> Vec<&Expr> {
+    split_binary_impl(expr, op, vec![])
+}
+
+fn split_binary_impl<'a>(
+    expr: &'a Expr,
+    operator: Operator,
+    mut exprs: Vec<&'a Expr>,
+) -> Vec<&'a Expr> {
+    match expr {
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if *op == operator => {
+            let exprs = split_binary_impl(left, operator, exprs);
+            split_binary_impl(right, operator, exprs)
+        }
+        Expr::Alias(expr, _) => split_binary_impl(expr, operator, exprs),
+        other => {
+            exprs.push(other);
+            exprs
+        }
+    }
+}
+
+/// Given a list of lists of [`Expr`]s, returns a list of lists of
+/// [`Expr`]s of expressions where there is one expression from each
+/// from each of the input expressions
+///
+/// For example, given the input `[[a, b], [c], [d, e]]` returns
+/// `[a, c, d], [a, c, e], [b, c, d], [b, c, e]]`.
+fn permutations(mut exprs: VecDeque<Vec<&Expr>>) -> Vec<Vec<&Expr>> {
+    let first = if let Some(first) = exprs.pop_front() {
+        first
+    } else {
+        return vec![];
+    };
+
+    // base case:
+    if exprs.is_empty() {
+        first.into_iter().map(|e| vec![e]).collect()
+    } else {
+        first
+            .iter()
+            .flat_map(|expr| {
+                permutations(exprs.clone())
+                    .into_iter()
+                    .map(|expr_list| {
+                        // Create [expr, ...] for each permutation
+                        std::iter::once(expr.clone())
+                            .chain(expr_list.into_iter())
+                            .collect::<Vec<&Expr>>()
+                    })
+                    .collect::<Vec<Vec<&Expr>>>()
+            })
+            .collect()
+    }
+}
+
+const MAX_CNF_REWRITE_CONJUNCTS: usize = 10;
+
+/// Tries to convert an expression to conjunctive normal form (CNF).
+///
+/// Does not convert the expression if the total number of conjuncts
+/// (exprs ANDed together) would exceed [`MAX_CNF_REWRITE_CONJUNCTS`].
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+///
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+///
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_optimizer::utils::cnf_rewrite;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// assert_eq!(expect, cnf_rewrite(expr));
+/// ```
+pub fn cnf_rewrite(expr: Expr) -> Expr {
+    // Find all exprs joined by OR
+    let disjuncts = split_binary(&expr, Operator::Or);
+
+    // For each expr, split now on AND
+    // A OR B OR C --> split each A, B and C
+    let disjunct_conjuncts: VecDeque<Vec<&Expr>> = disjuncts
+        .into_iter()
+        .map(|e| split_binary(e, Operator::And))
+        .collect::<VecDeque<_>>();
+
+    // Decide if we want to distribute the clauses. Heuristic is
+    // chosen to avoid creating huge predicates
+    let num_conjuncts = disjunct_conjuncts
+        .iter()
+        .fold(1usize, |sz, exprs| sz.saturating_mul(exprs.len()));
+
+    if disjunct_conjuncts.iter().any(|exprs| exprs.len() > 1)
+        && num_conjuncts < MAX_CNF_REWRITE_CONJUNCTS
+    {
+        let or_clauses = permutations(disjunct_conjuncts)
+            .into_iter()
+            // form the OR clauses( A OR B OR C ..)
+            .map(|exprs| disjunction(exprs.into_iter().cloned()).unwrap());
+        conjunction(or_clauses).unwrap()
+    }
+    // otherwise return the original expression
+    else {
+        expr
+    }
+}

Review Comment:
   Yes, without using the rewrite trait seems more clearly 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1293591733

   > @alamb Thanks for your help. This change looks better than mine 😄
   
   
   
   Thanks for all your work on Page Index ❤️ !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1291080694

   > @alamb you can try this 😂, maybe i will learn something from you, i am not very familiar with this rewrite framework
   
   > I can continue to work on page index
   
   Sounds good ! I will try and find time tomorrow 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1006368686


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),

Review Comment:
   my bad 😂



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1006933068


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),

Review Comment:
   I think that was actually my bad when I refactored it back to `conjunction`



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),

Review Comment:
   I think that was actually my bad when I refactored it back to `conjunction` 🤦 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1295815780

   Benchmark runs are scheduled for baseline = 6fae0bd3e2d0bc7f1406546b82a9ad0a2803a733 and contender = 345234550712173477e7807ba2cf67dd2ffb9ed5. 345234550712173477e7807ba2cf67dd2ffb9ed5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/318f67aaf86449cb9e070cfc8fd99e70...39af42b8fb5c4c47a8b1c3af51d29d4c/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5733339a7a22413ab82d6cd2b94e8ba0...a1d84573bf5a4b2c8fdce8a646e22eff/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7cd352eea9184fcf91436be03ae8f44a...bc07167193cb4eb1bd5021934b40c6cf/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a84bd5fe624a47789a030d958cffd1e7...9cd872f6ce3843d3bfd193deb8f9d157/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1007571029


##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: (n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE")) AND (n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY"))

Review Comment:
   @isidentical  thanks for testing👍! I think this rewrite will make `row_filter` work, I think it will boost up query when   `row_filter` is stable! 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb merged PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1005869701


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+/// Splits an binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// See [`split_binary_owned`] for more details and an example.
+pub fn split_binary(expr: &Expr, op: Operator) -> Vec<&Expr> {
+    split_binary_impl(expr, op, vec![])
+}
+
+fn split_binary_impl<'a>(
+    expr: &'a Expr,
+    operator: Operator,
+    mut exprs: Vec<&'a Expr>,
+) -> Vec<&'a Expr> {
+    match expr {
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if *op == operator => {
+            let exprs = split_binary_impl(left, operator, exprs);
+            split_binary_impl(right, operator, exprs)
+        }
+        Expr::Alias(expr, _) => split_binary_impl(expr, operator, exprs),
+        other => {
+            exprs.push(other);
+            exprs
+        }
+    }
+}
+
+/// Given a list of lists of [`Expr`]s, returns a list of lists of
+/// [`Expr`]s of expressions where there is one expression from each
+/// from each of the input expressions
+///
+/// For example, given the input `[[a, b], [c], [d, e]]` returns
+/// `[a, c, d], [a, c, e], [b, c, d], [b, c, e]]`.
+fn permutations(mut exprs: VecDeque<Vec<&Expr>>) -> Vec<Vec<&Expr>> {
+    let first = if let Some(first) = exprs.pop_front() {
+        first
+    } else {
+        return vec![];
+    };
+
+    // base case:
+    if exprs.is_empty() {
+        first.into_iter().map(|e| vec![e]).collect()
+    } else {
+        first
+            .iter()
+            .flat_map(|expr| {
+                permutations(exprs.clone())
+                    .into_iter()
+                    .map(|expr_list| {
+                        // Create [expr, ...] for each permutation
+                        std::iter::once(expr.clone())
+                            .chain(expr_list.into_iter())
+                            .collect::<Vec<&Expr>>()
+                    })
+                    .collect::<Vec<Vec<&Expr>>>()
+            })
+            .collect()
+    }
+}
+
+const MAX_CNF_REWRITE_CONJUNCTS: usize = 10;
+
+/// Tries to convert an expression to conjunctive normal form (CNF).
+///
+/// Does not convert the expression if the total number of conjuncts
+/// (exprs ANDed together) would exceed [`MAX_CNF_REWRITE_CONJUNCTS`].
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+///
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+///
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_optimizer::utils::cnf_rewrite;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// assert_eq!(expect, cnf_rewrite(expr));
+/// ```
+pub fn cnf_rewrite(expr: Expr) -> Expr {
+    // Find all exprs joined by OR
+    let disjuncts = split_binary(&expr, Operator::Or);
+
+    // For each expr, split now on AND
+    // A OR B OR C --> split each A, B and C
+    let disjunct_conjuncts: VecDeque<Vec<&Expr>> = disjuncts
+        .into_iter()
+        .map(|e| split_binary(e, Operator::And))
+        .collect::<VecDeque<_>>();
+
+    // Decide if we want to distribute the clauses. Heuristic is
+    // chosen to avoid creating huge predicates
+    let num_conjuncts = disjunct_conjuncts
+        .iter()
+        .fold(1usize, |sz, exprs| sz.saturating_mul(exprs.len()));
+
+    if disjunct_conjuncts.iter().any(|exprs| exprs.len() > 1)
+        && num_conjuncts < MAX_CNF_REWRITE_CONJUNCTS
+    {
+        let or_clauses = permutations(disjunct_conjuncts)
+            .into_iter()
+            // form the OR clauses( A OR B OR C ..)
+            .map(|exprs| disjunction(exprs.into_iter().cloned()).unwrap());
+        conjunction(or_clauses).unwrap()
+    }
+    // otherwise return the original expression
+    else {
+        expr
+    }
+}

Review Comment:
   @Ted-Jiang  I got nerdsniped working on this PR -- I hope you don't mind I rewrote your logic into something that avoided cloning unless it is actually doing the rewrite



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -655,4 +822,135 @@ mod tests {
             "mismatch rewriting expr_from: {expr_from} to {rewrite_to}"
         )
     }
+
+    #[test]
+    fn test_permutations() {
+        assert_eq!(make_permutations(vec![]), vec![] as Vec<Vec<Expr>>)
+    }
+
+    #[test]
+    fn test_permutations_one() {
+        // [[a]] --> [[a]]
+        assert_eq!(
+            make_permutations(vec![vec![col("a")]]),
+            vec![vec![col("a")]]
+        )
+    }
+
+    #[test]
+    fn test_permutations_two() {
+        // [[a, b]] --> [[a], [b]]
+        assert_eq!(
+            make_permutations(vec![vec![col("a"), col("b")]]),
+            vec![vec![col("a")], vec![col("b")]]
+        )
+    }
+
+    #[test]
+    fn test_permutations_two_and_one() {
+        // [[a, b], [c]] --> [[a, c], [b, c]]
+        assert_eq!(
+            make_permutations(vec![vec![col("a"), col("b")], vec![col("c")]]),
+            vec![vec![col("a"), col("c")], vec![col("b"), col("c")]]
+        )
+    }
+
+    #[test]
+    fn test_permutations_two_and_one_and_two() {
+        // [[a, b], [c], [d, e]] --> [[a, c, d], [a, c, e], [b, c, d], [b, c, e]]
+        assert_eq!(
+            make_permutations(vec![
+                vec![col("a"), col("b")],
+                vec![col("c")],
+                vec![col("d"), col("e")]
+            ]),
+            vec![
+                vec![col("a"), col("c"), col("d")],
+                vec![col("a"), col("c"), col("e")],
+                vec![col("b"), col("c"), col("d")],
+                vec![col("b"), col("c"), col("e")],
+            ]
+        )
+    }
+
+    /// call permutations with owned `Expr`s for easier testing
+    fn make_permutations(exprs: impl IntoIterator<Item = Vec<Expr>>) -> Vec<Vec<Expr>> {
+        let exprs = exprs.into_iter().collect::<Vec<_>>();
+
+        let exprs: VecDeque<Vec<&Expr>> = exprs
+            .iter()
+            .map(|exprs| exprs.iter().collect::<Vec<&Expr>>())
+            .collect();
+
+        permutations(exprs)
+            .into_iter()
+            // copy &Expr --> Expr
+            .map(|exprs| exprs.into_iter().cloned().collect())
+            .collect()
+    }
+
+    #[test]
+    fn test_rewrite_cnf() {

Review Comment:
   These tests are all the same



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+/// Splits an binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// See [`split_binary_owned`] for more details and an example.
+pub fn split_binary(expr: &Expr, op: Operator) -> Vec<&Expr> {
+    split_binary_impl(expr, op, vec![])
+}
+
+fn split_binary_impl<'a>(
+    expr: &'a Expr,
+    operator: Operator,
+    mut exprs: Vec<&'a Expr>,
+) -> Vec<&'a Expr> {
+    match expr {
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if *op == operator => {
+            let exprs = split_binary_impl(left, operator, exprs);
+            split_binary_impl(right, operator, exprs)
+        }
+        Expr::Alias(expr, _) => split_binary_impl(expr, operator, exprs),
+        other => {
+            exprs.push(other);
+            exprs
+        }
+    }
+}
+
+/// Given a list of lists of [`Expr`]s, returns a list of lists of
+/// [`Expr`]s of expressions where there is one expression from each
+/// from each of the input expressions
+///
+/// For example, given the input `[[a, b], [c], [d, e]]` returns
+/// `[a, c, d], [a, c, e], [b, c, d], [b, c, e]]`.
+fn permutations(mut exprs: VecDeque<Vec<&Expr>>) -> Vec<Vec<&Expr>> {
+    let first = if let Some(first) = exprs.pop_front() {
+        first
+    } else {
+        return vec![];
+    };
+
+    // base case:
+    if exprs.is_empty() {
+        first.into_iter().map(|e| vec![e]).collect()
+    } else {
+        first
+            .iter()
+            .flat_map(|expr| {
+                permutations(exprs.clone())
+                    .into_iter()
+                    .map(|expr_list| {
+                        // Create [expr, ...] for each permutation
+                        std::iter::once(expr.clone())
+                            .chain(expr_list.into_iter())
+                            .collect::<Vec<&Expr>>()
+                    })
+                    .collect::<Vec<Vec<&Expr>>>()
+            })
+            .collect()
+    }
+}
+
+const MAX_CNF_REWRITE_CONJUNCTS: usize = 10;

Review Comment:
   By setting limiting the number of exprs to be created, I also avoided having to explicitly disable the cnf rewrite



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1292929972

   @alamb Thanks for your help. This change looks better than mine 😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1003764553


##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY")

Review Comment:
   this is so clever



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1000966996


##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -80,6 +83,11 @@ impl State {
             .zip(predicates.1)
             .for_each(|(expr, cols)| self.filters.push((expr.clone(), cols.clone())))
     }
+
+    fn with_cnf_rewrite(mut self) -> Self {

Review Comment:
   I think this should have some doccomments about what it is and when one it should/should not be set



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -96,30 +96,173 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// ];
 ///
 /// // use split_conjunction_owned to split them
-/// assert_eq!(split_conjunction_owned(expr), split);
+/// assert_eq!(split_conjunction_owned(expr, Operator::And), split);
 /// ```
-pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+pub fn split_conjunction_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_conjunction_owned_impl(expr, op, vec![])
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+fn split_conjunction_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_conjunction_owned_impl(*left, Operator::And, exprs);
+            split_conjunction_owned_impl(*right, Operator::And, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, Operator::And, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_expr::expr_rewriter::ExprRewritable;
+/// # use datafusion_optimizer::utils::CnfHelper;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// // use split_conjunction_owned to split them
+/// assert_eq!(expr.rewrite(& mut CnfHelper::new()).unwrap(), expect);
+/// ```
+///
+pub struct CnfHelper {
+    max_count: usize,
+    current_count: usize,
+    exprs: Vec<Expr>,
+    original_expr: Option<Expr>,
+}
+
+impl CnfHelper {
+    pub fn new() -> Self {
+        CnfHelper {
+            max_count: 50,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    pub fn new_with_max_count(max_count: usize) -> Self {
+        CnfHelper {
+            max_count,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    fn increment_and_check_overload(&mut self) -> bool {
+        self.current_count += 1;
+        self.current_count >= self.max_count
+    }
+}
+
+impl ExprRewriter for CnfHelper {

Review Comment:
   I feel like this code is a little confusing -- it isn't really doing a rewrite the same way other `ExprRewriters` do. Instead it is doing an analysis in one pass through (via `pre_visit`) and then replacing it in one go
   
   I also feel like number of `clone`s is significant -- it both copies the original expr (in case the rewrite doesn't work) but also copies the exprs in the BinaryExprs 
   
   I feel like there must be a way to avoid at least one of those copies 🤔  But I am not sure without trying it myself



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1007884061


##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: (n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE")) AND (n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY"))

Review Comment:
   Thanks for checking @isidentical  -- I think there are only 25 rows in the `NATION` table in TPCH, so the ordering of predicates doesn't really matter for performance in that case 😆 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1294796859

   I think for the Physical FilterExec, we should still keep the original form or a simplified form for runtime evaluation.
   And since we are working on the CBO stats and Filter selectivity estimation, it better to use the simplified form. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1295813626

   Thanks again @Ted-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1287798493

   @alamb  Sorry for the late respond.
   >I feel like this code is a little confusing -- it isn't really doing a rewrite the same way other ExprRewriters do. Instead it is doing an analysis in one pass through (via pre_visit) and then replacing it in one go
   
   Yes😂, i think it because not mutate in the same place, this rewrite build an expr tress in another shape. I try to add `split_binary` to avoid some clone calls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1287799737

   > Thanks @Ted-Jiang -- I have one set of proposed changes here: [Ted-Jiang#27](https://github.com/Ted-Jiang/arrow-datafusion/pull/27)
   
   Thanks a lot !😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1294823104

   @alamb 
   I'm OK to merge the PR. But we might need a following PR to add a new rule to simply expressions before translate the logic Filter to physical FilterExec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1295813594

   I merged this with master locally and all the tests pass -- so I am merging this one in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1006934965


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -99,27 +99,176 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// assert_eq!(split_conjunction_owned(expr), split);
 /// ```
 pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+    split_binary_owned(expr, Operator::And)
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+/// Splits an owned binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// This is often used to "split" expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit, Operator};
+/// # use datafusion_optimizer::utils::split_binary_owned;
+/// # use std::ops::Add;
+/// // a=1 + b=2
+/// let expr = col("a").eq(lit(1)).add(col("b").eq(lit(2)));
+///
+/// // [a=1, b=2]
+/// let split = vec![
+///   col("a").eq(lit(1)),
+///   col("b").eq(lit(2)),
+/// ];
+///
+/// // use split_binary_owned to split them
+/// assert_eq!(split_binary_owned(expr, Operator::Plus), split);
+/// ```
+pub fn split_binary_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_binary_owned_impl(expr, op, vec![])
+}
+
+fn split_binary_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_binary_owned_impl(*left, operator, exprs);
+            split_binary_owned_impl(*right, operator, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_binary_owned_impl(*expr, operator, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+/// Splits an binary operator tree [`Expr`] such as `A <OP> B <OP> C` => `[A, B, C]`
+///
+/// See [`split_binary_owned`] for more details and an example.
+pub fn split_binary(expr: &Expr, op: Operator) -> Vec<&Expr> {
+    split_binary_impl(expr, op, vec![])
+}
+
+fn split_binary_impl<'a>(
+    expr: &'a Expr,
+    operator: Operator,
+    mut exprs: Vec<&'a Expr>,
+) -> Vec<&'a Expr> {
+    match expr {
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if *op == operator => {
+            let exprs = split_binary_impl(left, operator, exprs);
+            split_binary_impl(right, operator, exprs)
+        }
+        Expr::Alias(expr, _) => split_binary_impl(expr, operator, exprs),
+        other => {
+            exprs.push(other);
+            exprs
+        }
+    }
+}
+
+/// Given a list of lists of [`Expr`]s, returns a list of lists of
+/// [`Expr`]s of expressions where there is one expression from each
+/// from each of the input expressions
+///
+/// For example, given the input `[[a, b], [c], [d, e]]` returns
+/// `[a, c, d], [a, c, e], [b, c, d], [b, c, e]]`.
+fn permutations(mut exprs: VecDeque<Vec<&Expr>>) -> Vec<Vec<&Expr>> {
+    let first = if let Some(first) = exprs.pop_front() {
+        first
+    } else {
+        return vec![];
+    };
+
+    // base case:
+    if exprs.is_empty() {
+        first.into_iter().map(|e| vec![e]).collect()
+    } else {
+        first
+            .into_iter()
+            .flat_map(|expr| {
+                permutations(exprs.clone())
+                    .into_iter()
+                    .map(|expr_list| {
+                        // Create [expr, ...] for each permutation
+                        std::iter::once(expr)
+                            .chain(expr_list.into_iter())
+                            .collect::<Vec<&Expr>>()
+                    })
+                    .collect::<Vec<Vec<&Expr>>>()
+            })
+            .collect()
+    }
+}
+
+const MAX_CNF_REWRITE_CONJUNCTS: usize = 10;
+
+/// Tries to convert an expression to conjunctive normal form (CNF).
+///
+/// Does not convert the expression if the total number of conjuncts
+/// (exprs ANDed together) would exceed [`MAX_CNF_REWRITE_CONJUNCTS`].
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+///
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+///
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_optimizer::utils::cnf_rewrite;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// assert_eq!(expect, cnf_rewrite(expr));
+/// ```
+pub fn cnf_rewrite(expr: Expr) -> Expr {
+    // Find all exprs joined by OR
+    let disjuncts = split_binary(&expr, Operator::Or);
+
+    // For each expr, split now on AND
+    // A OR B OR C --> split each A, B and C
+    let disjunct_conjuncts: VecDeque<Vec<&Expr>> = disjuncts
+        .into_iter()
+        .map(|e| split_binary(e, Operator::And))
+        .collect::<VecDeque<_>>();
+
+    // Decide if we want to distribute the clauses. Heuristic is
+    // chosen to avoid creating huge predicates
+    let num_conjuncts = disjunct_conjuncts
+        .iter()
+        .fold(1usize, |sz, exprs| sz.saturating_mul(exprs.len()));
+
+    if disjunct_conjuncts.iter().any(|exprs| exprs.len() > 1)
+        && num_conjuncts < MAX_CNF_REWRITE_CONJUNCTS
+    {
+        let or_clauses = permutations(disjunct_conjuncts)
+            .into_iter()
+            // form the OR clauses( A OR B OR C ..)
+            .map(|exprs| disjunction(exprs.into_iter().cloned()).unwrap());

Review Comment:
   Yes that is my analysis too -- this code will clone exprs only if it rewrites the expression -- if not, it will not clone anything



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1000275988


##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -797,7 +817,7 @@ impl OptimizerRule for FilterPushDown {
         plan: &LogicalPlan,
         _: &mut OptimizerConfig,
     ) -> Result<LogicalPlan> {
-        optimize(plan, State::default())
+        optimize(plan, State::default().with_cnf_rewrite())

Review Comment:
   @alamb   test fail cause of after https://github.com/apache/arrow-datafusion/pull/3880, it runs rules multiple times.
   When after running first time it will produce `filter` front (near `scan`) the `join`, but we don't need write the filter-expr after pushDown (expect page index, i will manually add it), so i add this option.🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1003766881


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -96,30 +96,173 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// ];
 ///
 /// // use split_conjunction_owned to split them
-/// assert_eq!(split_conjunction_owned(expr), split);
+/// assert_eq!(split_conjunction_owned(expr, Operator::And), split);
 /// ```
-pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+pub fn split_conjunction_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_conjunction_owned_impl(expr, op, vec![])
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+fn split_conjunction_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_conjunction_owned_impl(*left, Operator::And, exprs);
+            split_conjunction_owned_impl(*right, Operator::And, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, Operator::And, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_expr::expr_rewriter::ExprRewritable;
+/// # use datafusion_optimizer::utils::CnfHelper;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// // use split_conjunction_owned to split them
+/// assert_eq!(expr.rewrite(& mut CnfHelper::new()).unwrap(), expect);
+/// ```
+///
+pub struct CnfHelper {
+    max_count: usize,
+    current_count: usize,
+    exprs: Vec<Expr>,
+    original_expr: Option<Expr>,
+}
+
+impl CnfHelper {
+    pub fn new() -> Self {
+        CnfHelper {
+            max_count: 50,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    pub fn new_with_max_count(max_count: usize) -> Self {
+        CnfHelper {
+            max_count,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    fn increment_and_check_overload(&mut self) -> bool {
+        self.current_count += 1;
+        self.current_count >= self.max_count
+    }
+}
+
+impl ExprRewriter for CnfHelper {

Review Comment:
   @Ted-Jiang  did you try to simplify this further -- if not, would you like me to try and find time to give it a try?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1003883373


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -96,30 +96,173 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// ];
 ///
 /// // use split_conjunction_owned to split them
-/// assert_eq!(split_conjunction_owned(expr), split);
+/// assert_eq!(split_conjunction_owned(expr, Operator::And), split);
 /// ```
-pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+pub fn split_conjunction_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_conjunction_owned_impl(expr, op, vec![])
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+fn split_conjunction_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_conjunction_owned_impl(*left, Operator::And, exprs);
+            split_conjunction_owned_impl(*right, Operator::And, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, Operator::And, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_expr::expr_rewriter::ExprRewritable;
+/// # use datafusion_optimizer::utils::CnfHelper;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// // use split_conjunction_owned to split them
+/// assert_eq!(expr.rewrite(& mut CnfHelper::new()).unwrap(), expect);
+/// ```
+///
+pub struct CnfHelper {
+    max_count: usize,
+    current_count: usize,
+    exprs: Vec<Expr>,
+    original_expr: Option<Expr>,
+}
+
+impl CnfHelper {
+    pub fn new() -> Self {
+        CnfHelper {
+            max_count: 50,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    pub fn new_with_max_count(max_count: usize) -> Self {
+        CnfHelper {
+            max_count,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    fn increment_and_check_overload(&mut self) -> bool {
+        self.current_count += 1;
+        self.current_count >= self.max_count
+    }
+}
+
+impl ExprRewriter for CnfHelper {

Review Comment:
   I can continue to work on page index



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1294986422

   > I'm OK to merge the PR. But we might need a following PR to add a new rule to simply expressions before translate the logic Filter to physical FilterExec.
   
   I double checked @mingmwang 
   
   This code is called as part of `FilterPushdown`
   
   https://github.com/apache/arrow-datafusion/blob/318b4ad56c826b317b0201f7031a47609ad215b5/datafusion/optimizer/src/optimizer.rs#L174
   
   And then simplification is then called 
   
   https://github.com/apache/arrow-datafusion/blob/318b4ad56c826b317b0201f7031a47609ad215b5/datafusion/optimizer/src/optimizer.rs#L178-L183
   
   so I think the simplifications should happen before physical planning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1293594384

   Perhaps @isidentical  or @mingmwang  would have time to review this logic as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1007258449


##########
benchmarks/expected-plans/q7.txt:
##########
@@ -3,7 +3,7 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
     Aggregate: groupBy=[[shipping.supp_nation, shipping.cust_nation, shipping.l_year]], aggr=[[SUM(shipping.volume)]]
       Projection: shipping.supp_nation, shipping.cust_nation, shipping.l_year, shipping.volume, alias=shipping
         Projection: n1.n_name AS supp_nation, n2.n_name AS cust_nation, datepart(Utf8("YEAR"), lineitem.l_shipdate) AS l_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS volume, alias=shipping
-          Filter: n1.n_name = Utf8("FRANCE") AND n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY") AND n2.n_name = Utf8("FRANCE")
+          Filter: (n1.n_name = Utf8("FRANCE") OR n2.n_name = Utf8("FRANCE")) AND (n2.n_name = Utf8("GERMANY") OR n1.n_name = Utf8("GERMANY"))

Review Comment:
   I've also wanted to check TPC-H (it shouldn't affect, but just to see if there is an unexpected regression). It seems like there aren't any regressions (`873.27 ms` vs  `878.54 ms`, only noise) 🚀 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#issuecomment-1294816091

   > I think for the Physical FilterExec, we should still keep the original form or a simplified form for runtime evaluation.
   
   @mingmwang I didn't quite understand this comment -- do you mean you think this PR is a good idea, or that it needs more work (to keep the original form)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1000275988


##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -797,7 +817,7 @@ impl OptimizerRule for FilterPushDown {
         plan: &LogicalPlan,
         _: &mut OptimizerConfig,
     ) -> Result<LogicalPlan> {
-        optimize(plan, State::default())
+        optimize(plan, State::default().with_cnf_rewrite())

Review Comment:
   @alamb   test failed cause of after https://github.com/apache/arrow-datafusion/pull/3880, it runs rules multiple times.
   When after running first time it will produce `filter` front (near `scan`) the `join`, but we don't need write the filter-expr after pushDown (expect page index, i will manually add it), so i add this option.🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1003882983


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -96,30 +96,173 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// ];
 ///
 /// // use split_conjunction_owned to split them
-/// assert_eq!(split_conjunction_owned(expr), split);
+/// assert_eq!(split_conjunction_owned(expr, Operator::And), split);
 /// ```
-pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+pub fn split_conjunction_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_conjunction_owned_impl(expr, op, vec![])
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+fn split_conjunction_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_conjunction_owned_impl(*left, Operator::And, exprs);
+            split_conjunction_owned_impl(*right, Operator::And, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, Operator::And, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_expr::expr_rewriter::ExprRewritable;
+/// # use datafusion_optimizer::utils::CnfHelper;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// // use split_conjunction_owned to split them
+/// assert_eq!(expr.rewrite(& mut CnfHelper::new()).unwrap(), expect);
+/// ```
+///
+pub struct CnfHelper {
+    max_count: usize,
+    current_count: usize,
+    exprs: Vec<Expr>,
+    original_expr: Option<Expr>,
+}
+
+impl CnfHelper {
+    pub fn new() -> Self {
+        CnfHelper {
+            max_count: 50,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    pub fn new_with_max_count(max_count: usize) -> Self {
+        CnfHelper {
+            max_count,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    fn increment_and_check_overload(&mut self) -> bool {
+        self.current_count += 1;
+        self.current_count >= self.max_count
+    }
+}
+
+impl ExprRewriter for CnfHelper {

Review Comment:
   @alamb you can try this 😂, maybe i will learn something from you, i am not very familiar with this rewrite framework



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #3903: Factorize common AND factors out of OR predicates to support filterPu…

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #3903:
URL: https://github.com/apache/arrow-datafusion/pull/3903#discussion_r1003882983


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -96,30 +96,173 @@ fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&
 /// ];
 ///
 /// // use split_conjunction_owned to split them
-/// assert_eq!(split_conjunction_owned(expr), split);
+/// assert_eq!(split_conjunction_owned(expr, Operator::And), split);
 /// ```
-pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {
-    split_conjunction_owned_impl(expr, vec![])
+pub fn split_conjunction_owned(expr: Expr, op: Operator) -> Vec<Expr> {
+    split_conjunction_owned_impl(expr, op, vec![])
 }
 
-fn split_conjunction_owned_impl(expr: Expr, mut exprs: Vec<Expr>) -> Vec<Expr> {
+fn split_conjunction_owned_impl(
+    expr: Expr,
+    operator: Operator,
+    mut exprs: Vec<Expr>,
+) -> Vec<Expr> {
     match expr {
-        Expr::BinaryExpr(BinaryExpr {
-            right,
-            op: Operator::And,
-            left,
-        }) => {
-            let exprs = split_conjunction_owned_impl(*left, exprs);
-            split_conjunction_owned_impl(*right, exprs)
+        Expr::BinaryExpr(BinaryExpr { right, op, left }) if op == operator => {
+            let exprs = split_conjunction_owned_impl(*left, Operator::And, exprs);
+            split_conjunction_owned_impl(*right, Operator::And, exprs)
         }
-        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, exprs),
+        Expr::Alias(expr, _) => split_conjunction_owned_impl(*expr, Operator::And, exprs),
         other => {
             exprs.push(other);
             exprs
         }
     }
 }
 
+///  Converts an expression to conjunctive normal form (CNF).
+///
+/// The following expression is in CNF:
+///  `(a OR b) AND (c OR d)`
+/// The following is not in CNF:
+///  `(a AND b) OR c`.
+/// But could be rewrite to a CNF expression:
+///  `(a OR c) AND (b OR c)`.
+///
+/// # Example
+/// ```
+/// # use datafusion_expr::{col, lit};
+/// # use datafusion_expr::expr_rewriter::ExprRewritable;
+/// # use datafusion_optimizer::utils::CnfHelper;
+/// // (a=1 AND b=2)OR c = 3
+/// let expr1 = col("a").eq(lit(1)).and(col("b").eq(lit(2)));
+/// let expr2 = col("c").eq(lit(3));
+/// let expr = expr1.or(expr2);
+///
+///  //(a=1 or c=3)AND(b=2 or c=3)
+/// let expr1 = col("a").eq(lit(1)).or(col("c").eq(lit(3)));
+/// let expr2 = col("b").eq(lit(2)).or(col("c").eq(lit(3)));
+/// let expect = expr1.and(expr2);
+/// // use split_conjunction_owned to split them
+/// assert_eq!(expr.rewrite(& mut CnfHelper::new()).unwrap(), expect);
+/// ```
+///
+pub struct CnfHelper {
+    max_count: usize,
+    current_count: usize,
+    exprs: Vec<Expr>,
+    original_expr: Option<Expr>,
+}
+
+impl CnfHelper {
+    pub fn new() -> Self {
+        CnfHelper {
+            max_count: 50,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    pub fn new_with_max_count(max_count: usize) -> Self {
+        CnfHelper {
+            max_count,
+            current_count: 0,
+            exprs: vec![],
+            original_expr: None,
+        }
+    }
+
+    fn increment_and_check_overload(&mut self) -> bool {
+        self.current_count += 1;
+        self.current_count >= self.max_count
+    }
+}
+
+impl ExprRewriter for CnfHelper {

Review Comment:
   @alamb Yes, you can try this 😂, maybe i will learn something from you, i am not very familiar with this rewrite framework



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org