You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/20 20:04:46 UTC

[GitHub] [arrow-datafusion] isidentical opened a new pull request, #3912: [wip]: Expression boundary analysis framework

isidentical opened a new pull request, #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912

   This is just a draft to show what is currently being planned and how it can be integrated for #3898.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1309062976

   Benchmark runs are scheduled for baseline = 0f18e76f19306072432da18b707cb68e457ffbfe and contender = 01941aae26ca14f765249b00c0ad83c28285167c. 01941aae26ca14f765249b00c0ad83c28285167c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/bb73a1e5906847e09fa4c2f9875c60d5...28b13551e9cc477583d4884c20e5d68a/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/88d236c3cc44448c99084ed260086e17...ad3010c7afcd48e583ed1235ef415ece/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8ceb31a93dde43c58a0e950abc65ea4a...32edb5a7d853462e975ebbbcba346141/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/132b8c8a976d4bcb98ef00b22c4cb1dd...ec1d10bcfdbc424cb63061c88a6d017b/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
alamb merged PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1002122290


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -771,32 +745,47 @@ fn compare_left_boundaries(
             // it is actually smaller than what we have right now) and it is a valid
             // value (e.g. [0, 100] < -100 would update the boundaries to [0, -100] if
             // there weren't the selectivity check).
-            if rhs_scalar_value < variadic_max && selectivity > 0.0 {
-                (variadic_min, rhs_scalar_value)
+            if right < left_max && selectivity > 0.0 {
+                (left_min, right)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         Operator::Gt | Operator::GtEq => {
             // Same as above, but this time we want to limit the lower bound.
-            if rhs_scalar_value > variadic_min && selectivity > 0.0 {
-                (rhs_scalar_value, variadic_max)
+            if right > left_min && selectivity > 0.0 {
+                (right, left_max)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         // For equality, we don't have the range problem so even if the selectivity
         // is 0.0, we can still update the boundaries.
-        Operator::Eq => (rhs_scalar_value.clone(), rhs_scalar_value),
+        Operator::Eq => (right.clone(), right),
         _ => unreachable!(),
     };
 
-    Some(ExprBoundaries {
-        min_value: new_min,
-        max_value: new_max,
-        distinct_count,
-        selectivity: Some(selectivity),
-    })
+    // The context represents all the knowledge we have gathered during the
+    // analysis process, which we can now add more since the expression's upper
+    // and lower boundaries might have changed.
+    let left_bounds = ExprBoundaries::new(left_min, left_max, left_bounds.distinct_count);
+    left.apply(context, &left_bounds);

Review Comment:
   🤔  don't the rules about combining information about the left and right side depend on the operators?
   
   Like if you have
   
   ```
   (a < 5) OR (a > 10)
   ```
   
   You would apply the boundaries differently than if it were a conjunction?
   
   ```
   (a < 5) AND (a > 10)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1300829840

   Also CC: @Dandandan @andygrove


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1002562258


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -771,32 +745,47 @@ fn compare_left_boundaries(
             // it is actually smaller than what we have right now) and it is a valid
             // value (e.g. [0, 100] < -100 would update the boundaries to [0, -100] if
             // there weren't the selectivity check).
-            if rhs_scalar_value < variadic_max && selectivity > 0.0 {
-                (variadic_min, rhs_scalar_value)
+            if right < left_max && selectivity > 0.0 {
+                (left_min, right)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         Operator::Gt | Operator::GtEq => {
             // Same as above, but this time we want to limit the lower bound.
-            if rhs_scalar_value > variadic_min && selectivity > 0.0 {
-                (rhs_scalar_value, variadic_max)
+            if right > left_min && selectivity > 0.0 {
+                (right, left_max)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         // For equality, we don't have the range problem so even if the selectivity
         // is 0.0, we can still update the boundaries.
-        Operator::Eq => (rhs_scalar_value.clone(), rhs_scalar_value),
+        Operator::Eq => (right.clone(), right),
         _ => unreachable!(),
     };
 
-    Some(ExprBoundaries {
-        min_value: new_min,
-        max_value: new_max,
-        distinct_count,
-        selectivity: Some(selectivity),
-    })
+    // The context represents all the knowledge we have gathered during the
+    // analysis process, which we can now add more since the expression's upper
+    // and lower boundaries might have changed.
+    let left_bounds = ExprBoundaries::new(left_min, left_max, left_bounds.distinct_count);
+    left.apply(context, &left_bounds);

Review Comment:
   I've give it a try for `AND`, it was surprisingly easy to implement it 😄 d95ce05e7e13d119eca506695bbdf97ba57d1615



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1303147397

   @isidentical One quick question here, a complex binary expressions mixed of AND/OR expressions might generate multiple discrete intervals, make it looks like histograms actually. How can our API models this ? 
   
   Column a boundaries [0, 100]
   
   Expressions:
   a = 1 or a = 9  or (a > 30 and a  <= 50) or (a > 70 and a  <= 80)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1303118112

   Can we merge this PR first ? Then we can continue and do some experimentations with those APIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
metesynnada commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1291681857

   We are also considering a range optimization on hash join, on queries like
   
   ```sql
   SELECT ...
   FROM left, right
   WHERE left.x > right.y AND
   		left.x + left.y < right.x + 10 AND
   		left.z = left.x
   ```
   
   and it paves the way for a streaming execution of the `HashJoinExec`. We’d like to collaborate on this PR. I guess a google doc would be the perfect medium to share ideas @isidentical.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1302622371

   > How about analyze()?
   
   `analyze()` is the endpoint we have that takes a context and then produces the boundaries. `apply()` in that sense is taking the context and updating the existing boundaries for that expression. E.g. where `a + 1 > 20 AND a < 70`, we need a way of changing the context to mark `a`'s lower boundaries to be `19` (`20 - 1`). Normally if `a` was just a column (like `a < 70`), we could downcast it to a `ColumnExpr` during the comparison; learn its position in the current schema and update it directly on the context. But since `a + 1` is the expression that has the lower boundary of `20`; I was thinking we can provide an additional method where each physical expression can decide how they want to propagate the information to their respective sub-expressions until they reach a column. 
   
   Like the example in [here](https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1286012013).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1301142778

   > Is apply() a good name? I thought about it a lot, but I am still not sure. Maybe something like update_context.
   
   How about `analyze()`?
   
   > Should I still emphasize this is an experimental API (I think we should?)
   
   I agree


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1002132576


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -771,32 +745,47 @@ fn compare_left_boundaries(
             // it is actually smaller than what we have right now) and it is a valid
             // value (e.g. [0, 100] < -100 would update the boundaries to [0, -100] if
             // there weren't the selectivity check).
-            if rhs_scalar_value < variadic_max && selectivity > 0.0 {
-                (variadic_min, rhs_scalar_value)
+            if right < left_max && selectivity > 0.0 {
+                (left_min, right)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         Operator::Gt | Operator::GtEq => {
             // Same as above, but this time we want to limit the lower bound.
-            if rhs_scalar_value > variadic_min && selectivity > 0.0 {
-                (rhs_scalar_value, variadic_max)
+            if right > left_min && selectivity > 0.0 {
+                (right, left_max)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         // For equality, we don't have the range problem so even if the selectivity
         // is 0.0, we can still update the boundaries.
-        Operator::Eq => (rhs_scalar_value.clone(), rhs_scalar_value),
+        Operator::Eq => (right.clone(), right),
         _ => unreachable!(),
     };
 
-    Some(ExprBoundaries {
-        min_value: new_min,
-        max_value: new_max,
-        distinct_count,
-        selectivity: Some(selectivity),
-    })
+    // The context represents all the knowledge we have gathered during the
+    // analysis process, which we can now add more since the expression's upper
+    // and lower boundaries might have changed.
+    let left_bounds = ExprBoundaries::new(left_min, left_max, left_bounds.distinct_count);
+    left.apply(context, &left_bounds);

Review Comment:
   > 🤔 don't the rules about combining information about the left and right side depend on the operators?
   
   Exactly! And that is certainly possible with the current implementation (all the control is delegated to expression's own `analyze()` implementation). This part is about a single comparison, but the same system (albeit a bit differently) can also be used with the conjunctions (like the following example, originally from #3898),
   
   > Each expression can either decide to share all the context with its children (e.g. in the case of AND, they'll all share a single context so all the following expressions would know that a for example can be only greater than 6 at this point etc.) or fork the context and then merge it together by using its logic (OR is the most basic example where you have to take the union of the ranges since it could be either of them).
   
   If we were to be implement `AND` and `OR`, then `AND` would pretty much pass the same context to throughout the conjunction. But for `OR`, when diverging to left and right, we would pass separate contexts (clone of the current version) and when processing for both of them is completed we can merge the boundaries of the columns in those contexts and update the existing context (basically a union of left and right, `(a < 5) OR (a > 10)` would say `bounds(a) = [min(a), 5] U [10, max(a)]`). This allows us to robustly handle chains of ANDs/ORs without worrying about it from the call sites (all the responsibility is delegated to the parent expression, which knows how to handle its children better than anyone 🙂).
   
   If it sounds interesting, I can try to give it a shot at implementing AND/OR s in this PR (to have an actual use case). WDYT? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1002134012


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -771,32 +745,47 @@ fn compare_left_boundaries(
             // it is actually smaller than what we have right now) and it is a valid
             // value (e.g. [0, 100] < -100 would update the boundaries to [0, -100] if
             // there weren't the selectivity check).
-            if rhs_scalar_value < variadic_max && selectivity > 0.0 {
-                (variadic_min, rhs_scalar_value)
+            if right < left_max && selectivity > 0.0 {
+                (left_min, right)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         Operator::Gt | Operator::GtEq => {
             // Same as above, but this time we want to limit the lower bound.
-            if rhs_scalar_value > variadic_min && selectivity > 0.0 {
-                (rhs_scalar_value, variadic_max)
+            if right > left_min && selectivity > 0.0 {
+                (right, left_max)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         // For equality, we don't have the range problem so even if the selectivity
         // is 0.0, we can still update the boundaries.
-        Operator::Eq => (rhs_scalar_value.clone(), rhs_scalar_value),
+        Operator::Eq => (right.clone(), right),
         _ => unreachable!(),
     };
 
-    Some(ExprBoundaries {
-        min_value: new_min,
-        max_value: new_max,
-        distinct_count,
-        selectivity: Some(selectivity),
-    })
+    // The context represents all the knowledge we have gathered during the
+    // analysis process, which we can now add more since the expression's upper
+    // and lower boundaries might have changed.
+    let left_bounds = ExprBoundaries::new(left_min, left_max, left_bounds.distinct_count);
+    left.apply(context, &left_bounds);

Review Comment:
   Slight note (after reading your comment and mine again): `apply()` is not a function with expression level side effects. It changes stuff in the given context. So if the `OR` passes two different contexts to its two sides, then they will each make changes in their own isolated context (which after the analysis is over, we can actually see and try to merge inside `OR`'s `analyze` implementation).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013747993


##########
datafusion/physical-expr/src/expressions/column.rs:
##########
@@ -107,6 +102,18 @@ impl PhysicalExpr for Column {
     ) -> Result<Arc<dyn PhysicalExpr>> {
         Ok(self)
     }
+
+    /// Return the boundaries of this column, if known.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   Why not `boundaries` or `expression_boundaries`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1001063513


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -771,32 +745,47 @@ fn compare_left_boundaries(
             // it is actually smaller than what we have right now) and it is a valid
             // value (e.g. [0, 100] < -100 would update the boundaries to [0, -100] if
             // there weren't the selectivity check).
-            if rhs_scalar_value < variadic_max && selectivity > 0.0 {
-                (variadic_min, rhs_scalar_value)
+            if right < left_max && selectivity > 0.0 {
+                (left_min, right)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         Operator::Gt | Operator::GtEq => {
             // Same as above, but this time we want to limit the lower bound.
-            if rhs_scalar_value > variadic_min && selectivity > 0.0 {
-                (rhs_scalar_value, variadic_max)
+            if right > left_min && selectivity > 0.0 {
+                (right, left_max)
             } else {
-                (variadic_min, variadic_max)
+                (left_min, left_max)
             }
         }
         // For equality, we don't have the range problem so even if the selectivity
         // is 0.0, we can still update the boundaries.
-        Operator::Eq => (rhs_scalar_value.clone(), rhs_scalar_value),
+        Operator::Eq => (right.clone(), right),
         _ => unreachable!(),
     };
 
-    Some(ExprBoundaries {
-        min_value: new_min,
-        max_value: new_max,
-        distinct_count,
-        selectivity: Some(selectivity),
-    })
+    // The context represents all the knowledge we have gathered during the
+    // analysis process, which we can now add more since the expression's upper
+    // and lower boundaries might have changed.
+    let left_bounds = ExprBoundaries::new(left_min, left_max, left_bounds.distinct_count);
+    left.apply(context, &left_bounds);

Review Comment:
   @alamb this is where we use `apply` (to update the current context with the `left`'s new lower/upper bounds)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1303915844

   Marking the PR draft until we reach a consensus on the API design. @alamb @Dandandan how do we feel about:
   - `fn boundaries(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries>;`
   - `fn apply_boundaries(&self, context: &mut AnalysisContext, boundaries: &ExprBoundaries);`
   
   I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though.
   
   @mingmwang on the point of discrete intervals, yes it is a matter of histograms. Currently we would have a non-uniform distribution but would represent it as a uniform range (`a in (1, 2, 24, 25)` would be represented as `{min=1, max=25, distinct=4}` which would imply a distribution of `1, 9, 17, 25` so not as precise). In the following phases, I hope to add a ValueDistribution construct to the `ExprBoundaries` so we would have a smaller-scoped histogram with all these ranges but we are not there yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1303047208

   > 
   
   
   
   > We are also considering a range optimization on hash join, on queries like
   > 
   > ```sql
   > SELECT ...
   > FROM left, right
   > WHERE left.x > right.y AND
   > 		left.x + left.y < right.x + 10 AND
   > 		left.z = left.x
   > ```
   > 
   > and it paves the way for a streaming execution of the `HashJoinExec`. We’d like to collaborate on this PR. I guess a google doc would be the perfect medium to share ideas @isidentical.
   
   Why this SQL can run `HashJoinExec` ? Could you please explain a bit? Since there is no equal-join conditions, we can not derive any equal join conditions from the original expression either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1306746408

   Based my understanding, I think the selectivity and row count estimation is more important, the column boundaries estimation and derivation is error prone, especially after couple of complex expressions, or some advanced SQL operators, we might have to give up the column boundaries and just propagate the estimation of the row counts.
   
   ````
   SELECT * FROM LINEITEM WHERE L_EXTENDEDPRICE*(1-L DISCOUNT) > 90000
   ````


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013706030


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   > Since AnalysisContext is in terms of the "current schema" it will likely almost always end up being modified if the execution plan modifies the schema (which most do).
   > 
   > I find that the mix of "input" and "output" of a `mut &AnalysisContext` to be confusing to explain.
   > 
   > Since an `AnalysisContext` has `ExprBoundaries` in it anyways, I recommend the signature be
   > 
   > ```rust
   > fn analyze(&self) -> AnalysisContext {
   > ...
   > }
   > ```
   > 
   > And then pass each input can produce the context / boundary analysis for itself and the parent operators can combine them
   > 
   > Given that the inputs will produced an owned AnalysisContext it will be clear what is input and what is output and I think it can be made quite fast in Rust
   
   I would prefer this change also. It makes the signature more clear. It is similar to the `EquivalenceProperties` in the ExecutionPlan, mixing "input" and "output" is very confusing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013706030


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   > Since AnalysisContext is in terms of the "current schema" it will likely almost always end up being modified if the execution plan modifies the schema (which most do).
   > 
   > I find that the mix of "input" and "output" of a `mut &AnalysisContext` to be confusing to explain.
   > 
   > Since an `AnalysisContext` has `ExprBoundaries` in it anyways, I recommend the signature be
   > 
   > ```rust
   > fn analyze(&self) -> AnalysisContext {
   > ...
   > }
   > ```
   > 
   > And then pass each input can produce the context / boundary analysis for itself and the parent operators can combine them
   > 
   > Given that the inputs will produced an owned AnalysisContext it will be clear what is input and what is output and I think it can be made quite fast in Rust
   
   I would prefer this change also. It makes the signature more clear. It is similar to the `EquivalenceProperties` in the ExecutionPlan, mixing "input" and "output" is very confusing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1306340705

   @alamb I think that sounds like a good plan! The primary place where this would be essential (or at least, I'd interpret it as essential; but maybe we can find a simpler solution) is `AND`s which we can work on in the subsequent PRs.
   
   For this PR, I've dropped support for 'intermediate column updates': so `apply()` API is gone, `context` is now shared as a regular reference (`&AnalysisContext`) and all the phases for intermediate analyses are removed (e.g. new upper bound computation from comparison analyzer).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1306728884

   Can we get this PR merged this week?
   I would like to start working on the Filter Selectivity estimation PR base on those new APIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013331395


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   > Since AnalysisContext is in terms of the "current schema" it will likely almost always end up being modified if the execution plan modifies the schema (which most do).
   
   You can think about as if it were a Statistics for that execution plan. It will be generated by the same place and will be constructed from the existing statistics (at least in `FilterExec`).
   
   > Since an AnalysisContext has ExprBoundaries in it anyways, I recommend the signature be
   > fn analyze(&self) -> AnalysisContext {
   
   If I understood it correctly, the problem with this approach is that it won't be able to see the existing boundaries of any columns. The place that wants to do the analysis needs to pass column information (otherwise it is not going to be a very useful analysis; unless the expression itself is self-sufficient `1 > 2`). All the children also need to receive a context, but depending on the parent (`OR` for example) it might be a different one than the one we received initially.
   
   > As I have mentioned before, I don't fully understand specially how column_boundaries can be used, but I am interested in seeing an example
   
   The only 'unknown' during the analysis process is the column references. Anything else can be inferred by the behaviour of the particular expression itself. So we need a way of conveying that information to each sub-expression when doing the analysis. I initially started using a vector of column statistics but then once we had to change it on certain expressions, like for example `a > 20 AND a < 70` where we have different boundaries for `a` when evaluating the second part of the conjunction (`a < 70`), I went with a mutable/shared context.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013748292


##########
datafusion/physical-expr/src/expressions/column.rs:
##########
@@ -107,6 +102,18 @@ impl PhysicalExpr for Column {
     ) -> Result<Arc<dyn PhysicalExpr>> {
         Ok(self)
     }
+
+    /// Return the boundaries of this column, if known.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {
+        assert!(self.index < context.column_boundaries.len());
+        context.column_boundaries[self.index].clone()
+    }
+
+    /// Update the boundaries of this column on the given context.
+    fn apply(&self, context: &mut AnalysisContext, boundaries: &ExprBoundaries) {

Review Comment:
   `apply_boundaries`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1292226439

   @metesynnada I've started a draft of the whole CBOs and statistics (including the expression analysis framework) on a [google doc](https://github.com/apache/arrow-datafusion/issues/3929#issuecomment-1292224231), would love to hear more on what sort of stuff range optimization would need and how we could reshape this API to be able to easily used for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1303048422

   `
   SELECT ...
   FROM left, right
   WHERE left.x > right.y AND
   		left.x + left.y < right.x + 10 AND
   		left.z = left.x
   `
   
   Same question. For this non-equal join, how can we apply hash join?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1290844659

   I would personally recommend starting with a google doc if you can do that (as I have found it is easiest to quickly iterate with them) and then we convert that to a document in https://arrow.apache.org/datafusion/contributor-guide/specification/index.html 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1306183236

   > I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though.
   
   I feel like we are somewhat stuck because the usecase (at least to me) of intermediate column analysis isn't 100% clear. To truly understand I think I need some example code showing how this would be used. Until then I can't seem to quite follow the proposals (maybe because I am too stuck in previous patterns I have seen)
   
   Here is what I propose:
   
   1. We get some sort of specific usecase for the intermediate column range expressions
   2. If we can't get that maybe we abandon the multi-column flexibility initially and work on getting something in with focus on specific columns and see where multi column would have helped


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1306760018

   And for a simple expressions like a + 100 < 20, can we safely derive the boundary of a [min,  -80) ?  Probably not, considering a + 100 might overflow the type's boundary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1290753951

   Another interesting point that this PR might need is probably a specification / design document, @alamb what would work best for you (I can try to add a couple more example analysis, like `a + b`; or maybe a specification like the ones in here https://arrow.apache.org/datafusion/contributor-guide/specification/index.html) to push this forward?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: [wip]: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1300770019

   A couple of questions:
   - Is `apply()` a good name? I thought about it a lot, but I am still not sure. Maybe something like `update_context`.
   - Should I still emphasize this is an experimental API (I think we should?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1012217058


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   Since AnalysisContext is in terms of the "current schema" it will likely almost always end up being modified if the execution plan modifies the schema (which most do).
   
   I find that the mix of "input" and "output" of a `mut &AnalysisContext` to be confusing to explain.
   
   
   Since an `AnalysisContext` has `ExprBoundaries` in it anyways, I recommend the signature be
   
   
   ```rust
   fn analyze(&self) -> AnalysisContext {
   ...
   }
   ```
   
   And then pass each input can produce the context / boundary analysis for itself and the parent operators can combine them
   
   Given that the inputs will produced an owned AnalysisContext it will be clear what is input and what is output and I think it can be made quite fast in Rust



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   Maybe AnalysisContext could be expanded to be:
   
   ```rust
   /// A context for collecting and aggregating known boundaries of an expression
   /// tree.
   #[derive(Clone, Debug, PartialEq)]
   pub struct AnalysisContext {
       /// the boundaries of this particular PhysicalExpr's output
       pub boundaries: ExprBoundaries,
   
       /// A list of known column boundaries, ordered by column index
       /// in the current schema.
       pub column_boundaries: Vec<Option<ExprBoundaries>>,
   }
   ```
   
   As I have mentioned before, I don't fully understand specially how `column_boundaries` can be used, but I am interested in seeing an example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1303148451

   Another example is :
   a in (1, 3, 5,  7, .....)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1014234660


##########
datafusion/physical-expr/src/expressions/column.rs:
##########
@@ -107,6 +102,18 @@ impl PhysicalExpr for Column {
     ) -> Result<Arc<dyn PhysicalExpr>> {
         Ok(self)
     }
+
+    /// Return the boundaries of this column, if known.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   I think having `boundaries()` and `apply_boundaries` makes it consistent. Will change, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3912: Expression boundary analysis framework

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#issuecomment-1309069523

   Thanks everyone for working with me on this while we iterate it on a few times! Hopefully the upcoming PRs would justify the APIs even more and opens up more use cases. If anyone is interested, I plan to add a few tickets for the parts that I am planning to work on (esp on the composite expressions) and if there is a special topic that interests you please let me know on #3898 so we can sync up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org