You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/12 14:36:17 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #3810: Improve the ergonomics of expression manipulation: `combine/split filter` `conjunction`, etc

alamb opened a new pull request, #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810

   DRAFT as it builds on https://github.com/apache/arrow-datafusion/pull/3809
   
   
   # Which issue does this PR close?
   
   
   Closes https://github.com/apache/arrow-datafusion/issues/3808
   
   
    # Rationale for this change
   
   The APIs for manipulating expressions were all over the map (sometimes return Vec, sometimes taking them as mut, owned, etc) as well as being inconsistently named and inconsistently tested. 
   
   In fact I couldn't find `split_filter` (which is similar, but not the same as `split_conjunction`)
   
   
   
   # What changes are included in this PR?
   Change the APIs to be consistently named and take reasonable arguments
   
   1. Change `split_conjunction` to return a `Vec` thus simplifying the API
   2. Rename split_filter to `split_conjunction_owned` to make it clear how it is related to `split_conjunction`
   3. Expand test coverage
   2. Rename `combine_filter` to `conjunction`
   2. Rename `combine_filter_disjunction` to `disjunction`
   4. Change APIs for `conjunction` and `disjunction` to avoid clones
   
   I will highlight the changes inline. 
   
   # Are there any user-facing changes?
   If anyone uses these APIs they will need to change them, but hopfully


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #3810: Make expression manipulation consistent and easier to use: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
alamb merged PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3810: Improve the ergonomics of expression manipulation: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#discussion_r993803786


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -51,62 +51,72 @@ pub fn optimize_children(
     from_plan(plan, &new_exprs, &new_inputs)
 }
 
-/// converts "A AND B AND C" => [A, B, C]
-pub fn split_conjunction<'a>(predicate: &'a Expr, predicates: &mut Vec<&'a Expr>) {
-    match predicate {
+/// Splits a conjunctive [`Expr`] such as `A AND B AND C` => `[A, B, C]`
+///
+/// This is often used to "split" filter expressions such as `col1 = 5
+/// AND col2 = 10` into [`col1 = 5`, `col2 = 10`];
+pub fn split_conjunction(expr: &Expr) -> Vec<&Expr> {
+    split_conjunction_impl(expr, vec![])
+}
+
+fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&'a Expr> {
+    match expr {
         Expr::BinaryExpr {
             right,
             op: Operator::And,
             left,
         } => {
-            split_conjunction(left, predicates);
-            split_conjunction(right, predicates);
+            let exprs = split_conjunction_impl(left, exprs);
+            split_conjunction_impl(right, exprs)
         }
-        Expr::Alias(expr, _) => {
-            split_conjunction(expr, predicates);
+        Expr::Alias(expr, _) => split_conjunction_impl(expr, exprs),
+        other => {
+            exprs.push(other);
+            exprs
         }
-        other => predicates.push(other),
     }
 }
 
-/// Combines an array of filter expressions into a single filter expression
-/// consisting of the input filter expressions joined with logical AND.
-/// Returns None if the filters array is empty.
-pub fn combine_filters(filters: &[Expr]) -> Option<Expr> {
-    if filters.is_empty() {
-        return None;
-    }
-    let combined_filter = filters
-        .iter()
-        .skip(1)
-        .fold(filters[0].clone(), |acc, filter| and(acc, filter.clone()));
-    Some(combined_filter)
+/// Splits an owned conjunctive [`Expr`] such as `A AND B AND C` => `[A, B, C]`
+///
+/// See [`split_conjunction`] for more details.
+pub fn split_conjunction_owned(expr: Expr) -> Vec<Expr> {

Review Comment:
   this function used to be called `uncombine_filters` which was not at all consistent with the `split_conjunction` that took a reference 🤯 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #3810: Make expression manipulation consistent and easier to use: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
jackwener commented on code in PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#discussion_r994609709


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -424,31 +468,74 @@ mod tests {
     use super::*;
     use arrow::datatypes::DataType;
     use datafusion_common::Column;
-    use datafusion_expr::{binary_expr, col, lit, utils::expr_to_columns};
+    use datafusion_expr::{col, lit, utils::expr_to_columns};
     use std::collections::HashSet;
     use std::ops::Add;
 
     #[test]
-    fn combine_zero_filters() {
-        let result = combine_filters(&[]);
-        assert_eq!(result, None);
+    fn test_split_conjunction() {

Review Comment:
   IMO, I think we can add test for `conjunction()`. At the same time, we can check the tree structure of this expression by using `match`.
   
   ```rs
   expr is (A B) C;
   
   /// using `match` to check.
   match expr {
      And(
         And(
            B,
            C
         ),
         C
      )
   }
   ```
   
   ```
   [A, B, C ,D , E] -> (((A B) C) (D E)) is different from ((((A B) C) D) E).
   ```
   
   we can see the result of `conjunction()` in the UT, ensure the result of tree structure of `conjunction()` 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #3810: Make expression manipulation consistent and easier to use: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#issuecomment-1279731019

   Benchmark runs are scheduled for baseline = 0b90a8a5c2635e08e80995954271fd06a256ac96 and contender = fc5081d48ef59e39c1b353dd45fcd13af6186676. fc5081d48ef59e39c1b353dd45fcd13af6186676 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/02b1805cbd8a4ba99a7b25f9a40c17af...944a75a5937f4c0da7e43419268f4942/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/26adca3eb26d426fa4123cf84228a3dc...7ffdebdac5d04ac8b8c06090f2a6a029/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/80514308bf8f4b9cbd8019cabcf7096e...2ba0aba120224524a4e79c85b91040a6/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a26348e698ca4067b937dc77cff4e7d4...959df79a15f542e2b24f2fff082c390e/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3810: Improve the ergonomics of expression manipulation: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#discussion_r993559271


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -766,7 +766,8 @@ mod tests {
             if !actual.is_empty() {
                 actual += "\n";
             }
-            actual += &format!("{}", plan.display_indent());
+            use std::fmt::Write as _;

Review Comment:
   clippy told me to do this



##########
datafusion/optimizer/src/filter_push_down.rs:
##########
@@ -341,8 +341,7 @@ fn optimize(plan: &LogicalPlan, mut state: State) -> Result<LogicalPlan> {
         }
         LogicalPlan::Analyze { .. } => push_down(&state, plan),
         LogicalPlan::Filter(Filter { input, predicate }) => {
-            let mut predicates = vec![];
-            utils::split_conjunction(predicate, &mut predicates);
+            let predicates = utils::split_conjunction(predicate);

Review Comment:
   here is an example where the `split_conjuntion` API is easier to use now



##########
datafusion/optimizer/src/decorrelate_where_in.rs:
##########
@@ -175,7 +173,7 @@ fn optimize_where_in(
     // build subquery side of join - the thing the subquery was querying
     let subqry_alias = format!("__sq_{}", optimizer_config.next_id());
     let mut subqry_plan = LogicalPlanBuilder::from((*subqry_input).clone());
-    if let Some(expr) = combine_filters(&other_subqry_exprs) {
+    if let Some(expr) = conjunction(other_subqry_exprs) {

Review Comment:
   here is an example of less copying (can use `other_subqry_exprs`directly)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #3810: Make expression manipulation consistent and easier to use: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
jackwener commented on code in PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#discussion_r994609709


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -424,31 +468,74 @@ mod tests {
     use super::*;
     use arrow::datatypes::DataType;
     use datafusion_common::Column;
-    use datafusion_expr::{binary_expr, col, lit, utils::expr_to_columns};
+    use datafusion_expr::{col, lit, utils::expr_to_columns};
     use std::collections::HashSet;
     use std::ops::Add;
 
     #[test]
-    fn combine_zero_filters() {
-        let result = combine_filters(&[]);
-        assert_eq!(result, None);
+    fn test_split_conjunction() {

Review Comment:
   IMO, I think we can add test for `conjunction()`. At the same time, we can check the tree structure of this expression by using `match`.
   
   ```rs
   expr is (A B) C;
   
   /// using `match` to check.
   match expr {
      And(
         And(
            B,
            C
         ),
         C
      )
   }
   ```
   
   [A, B, C ,D , E] -> (((A B) C) (D E)) is different from ((((A B) C) D) E).
   
   we can see the result of `conjunction()` in the UT
   
   



##########
datafusion/optimizer/src/utils.rs:
##########
@@ -51,62 +51,106 @@ pub fn optimize_children(
     from_plan(plan, &new_exprs, &new_inputs)
 }
 
-/// converts "A AND B AND C" => [A, B, C]
-pub fn split_conjunction<'a>(predicate: &'a Expr, predicates: &mut Vec<&'a Expr>) {
-    match predicate {
+/// Splits a conjunctive [`Expr`] such as `A AND B AND C` => `[A, B, C]`
+///
+/// See [`split_conjunction_owned`] for more details and an example.
+pub fn split_conjunction(expr: &Expr) -> Vec<&Expr> {
+    split_conjunction_impl(expr, vec![])
+}
+
+fn split_conjunction_impl<'a>(expr: &'a Expr, mut exprs: Vec<&'a Expr>) -> Vec<&'a Expr> {
+    match expr {
         Expr::BinaryExpr {
             right,
             op: Operator::And,
             left,
         } => {
-            split_conjunction(left, predicates);
-            split_conjunction(right, predicates);
+            let exprs = split_conjunction_impl(left, exprs);
+            split_conjunction_impl(right, exprs)
         }
-        Expr::Alias(expr, _) => {
-            split_conjunction(expr, predicates);
+        Expr::Alias(expr, _) => split_conjunction_impl(expr, exprs),

Review Comment:
   👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #3810: Make expression manipulation consistent and easier to use: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
jackwener commented on code in PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#discussion_r994609709


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -424,31 +468,74 @@ mod tests {
     use super::*;
     use arrow::datatypes::DataType;
     use datafusion_common::Column;
-    use datafusion_expr::{binary_expr, col, lit, utils::expr_to_columns};
+    use datafusion_expr::{col, lit, utils::expr_to_columns};
     use std::collections::HashSet;
     use std::ops::Add;
 
     #[test]
-    fn combine_zero_filters() {
-        let result = combine_filters(&[]);
-        assert_eq!(result, None);
+    fn test_split_conjunction() {

Review Comment:
   IMO, I think we can add test for `conjunction()`. At the same time, we can check the tree structure of this expression by using `match`.
   
   ```rs
   expr is (A B) C;
   
   /// using `match` to check.
   match expr {
      And(
         And(
            B,
            C
         ),
         C
      )
   }
   ```
   
   ```
   [A, B, C ,D , E] -> (((A B) C) (D E)) is different from ((((A B) C) D) E).
   ```
   
   we can see the result of `conjunction()` in the UT
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3810: Make expression manipulation consistent and easier to use: `combine/split filter` `conjunction`, etc

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3810:
URL: https://github.com/apache/arrow-datafusion/pull/3810#discussion_r996289143


##########
datafusion/optimizer/src/utils.rs:
##########
@@ -424,31 +468,74 @@ mod tests {
     use super::*;
     use arrow::datatypes::DataType;
     use datafusion_common::Column;
-    use datafusion_expr::{binary_expr, col, lit, utils::expr_to_columns};
+    use datafusion_expr::{col, lit, utils::expr_to_columns};
     use std::collections::HashSet;
     use std::ops::Add;
 
     #[test]
-    fn combine_zero_filters() {
-        let result = combine_filters(&[]);
-        assert_eq!(result, None);
+    fn test_split_conjunction() {

Review Comment:
   This was a great idea @jackwener  -- thank you. I added this test  09f3cf4e4
   
   As part of writing tests, I also found that the API for `disjunction` was slightly different than `conjunction` so I made them the same as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org