You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/03 19:45:55 UTC
[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: Expression boundary analysis framework

isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013331395


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {

Review Comment:
   > Since AnalysisContext is in terms of the "current schema" it will likely almost always end up being modified if the execution plan modifies the schema (which most do).
   
   You can think about as if it were a Statistics for that execution plan. It will be generated by the same place and will be constructed from the existing statistics (at least in `FilterExec`).
   
   > Since an AnalysisContext has ExprBoundaries in it anyways, I recommend the signature be
   > fn analyze(&self) -> AnalysisContext {
   
   If I understood it correctly, the problem with this approach is that it won't be able to see the existing boundaries of any columns. The place that wants to do the analysis needs to pass column information (otherwise it is not going to be a very useful analysis; unless the expression itself is self-sufficient `1 > 2`). All the children also need to receive a context, but depending on the parent (`OR` for example) it might be a different one than the one we received initially.
   
   > As I have mentioned before, I don't fully understand specially how column_boundaries can be used, but I am interested in seeing an example
   
   The only 'unknown' during the analysis process is the column references. Anything else can be inferred by the behaviour of the particular expression itself. So we need a way of conveying that information to each sub-expression when doing the analysis. I initially started using a vector of column statistics but then once we had to change it on certain expressions, like for example `a > 20 AND a < 70` where we have different boundaries for `a` when evaluating the second part of the conjunction (`a < 70`), I went with a mutable/shared context.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org