You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/20 00:32:05 UTC

[GitHub] [arrow-datafusion] isidentical opened a new issue, #3898: A framework for expression boundary analysis (and statistics)

isidentical opened a new issue, #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   TBD
   
   **Describe the solution you'd like**
   TBD
   
   **Describe alternatives you've considered**
   TBD
   
   **Additional context**
   Originating from #3868 (which implements a more rudimentary version of the proposed framework), with discussion points from @alamb on the possible unification/consolidation of pruning logic (as well as a gateway to more optimizations).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1286034135

   One benefit of keeping it inside `PhysicalExpr` is that then anyone who has an extension for `PhysicalExpr` can also provide their own expression analysis
   
   As long as they have reasonable default implementations (that return unknown ExprBounds) I think it would be fine


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

isidentical commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1297566804

   If I didn't miss anything, that should be all and we should be ready to at least make put the foundational work in. @alamb does it make sense to revive my existing PR #3912 (fix conflicts, add a bit more documentation in the code, etc.) and continue working on this iteratively for the next steps? (like other expression types).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

isidentical commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1297564476

A general overview of what happened during the discussions (in regards to the expression analysis part) and the meetup:

- Better value distributions rather than just knowing the `[min, max]`.

Although there isn't any concrete work, this was also part of the proposed 'future' section and we might be able to turn it into a reality once we have the general system ready (this is what Spark did IIRC, they initially implemented with basic ranges and then moved over with histograms when available).

- Support for comparing two expressions (`a > b`) with different ranges.

This can be supported with the existing framework, but actually needs an implementation 😇 Currently for binary expressions, we have a code path that is taken when we know either side has a scalar boundary (e.g. `a=[20, 20]; b=[10, 30]`) but we can introduce a second one very easily depending on how it would work 👀 Requires further research, but I don't think it needs to block this ticket (it can actually be a part of it).

- Not passing a mutable handle for the context but rather giving it up completely.

I'd be fine with this though just from a purely aesthetic point of view it looks a bit hard to parse 😄 I'm happy to be convinced though. Let's discuss it in the code review again!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

isidentical commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1309437374

   As we have the foundations now, we should be able to close this. New tickets regarding filter selectivity analysis and the overall expression analysis will be tracked in the meta issue of #3929.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1297619059

   > . @alamb does it make sense to revive my existing PR https://github.com/apache/arrow-datafusion/pull/3912 (fix conflicts, add a bit more documentation in the code, etc.) 
   
   I think this makes sense. If possible, I would recommend trying to get the basic API sketched out / commented well and a few examples -- keeping the PR small is the key I think to getting as many eyes on it as possible. 
   
   Thank you for driving this forward 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

isidentical commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1286012013

   > I love this idea @isidentical -- thank you for filing it -- 
   
   Thanks @alamb (and for all your feedback during the initial design) ❤️ 
   
   > I am still not quite sure about how well keeping column_boundaries would work in practice, but I think I would just have to see how it works in practice to find out.
   
   It also bothers me a bit, so maybe we can iterate on it to see if there is a simpler solution that can help us to solve the following problem without having `apply()` and `column_boundaries`. 
   
   ```rs
   let expr = parse("a >= 20");
   let mut context = Context { column_boundaries: [Boundary {min: 1, max: 100}] }; // this can be constructed from statistics as well
   let boundaries = expr.analyze(&mut context);
   assert!(context.column_boundaries[0].min == 20);
   ```
   
   If we want the condition above to succeed (considering we now know that `a`'s minimum value is `20`, due to `a >= 20`), we need a way of translating that information to column level (so that, if the next expression [in the same context] is `a < 30` we can localize it even further, etc.).
   
   The most simple solution that I can think of is actually checking whether `left` side is a column, and if so, changing the boundaries for it (`context.boundaries[left.index] = Boundary {min: 20, max: left.max}`) directly inside the filter selectivity analysis. But what can we do if the expression looks something like this: `a + 1 > 20`?
   
   This is where `apply()` comes into play. When the analysis for `a + 1 >= 20` is complete, we go through the following cycles:
   - (`a + 1 >= 20`'s `analyze`) `left.apply(Boundary {min: 20, max: left.max})`
     - (`a + 1`'s `apply`) `left.apply(Boundary {min: boundaries.min - 1, max: boundaries.max - 1})`
       - (`a`'s `apply`) `context.columns[self.index] =  boundaries`
    
    If left were a simple column, it will look the same:
    - (`a >= 20`'s `analyze`) `left.apply(Boundary {min: 20, max: left.max})`
      - (`a`'s `apply`) `context.columns[self.index] =  boundaries`
   
   And if any of the expressions in the next conjunction (or actually anything that shares the same context) references `a`, then it will just use the updated boundaries without doing anything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1310167658

   Thank you for driving this forward @isidentical 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1285974788

   I love this idea @isidentical  -- thank you for filing it -- I am still not quite sure about how well keeping `column_boundaries` would work in practice, but I think I would just have to see how it works in practice to find out. 
   
   I also wonder what "apply"ing boundaries to a `PhysicalExpr` means -- does it mean the output of the PhysicalExpr might change when run against some data? Does it mean the `PhysicalExpr` might change? Or is it solely to update the `AnalysisContext`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

isidentical commented on issue #3898:
URL: https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1286015059

   One thing I am not super sure of is whether we want to keep it inside the `PhysicalExpr` (with two new methods) or want to keep it in a separated protocol (so we would have a new trait, `PhysicalExprAnalyzer` and `PhysicalExpr` can then return the analyzer itself which would have apply/update methods). I am thinking that if it is only two methods, they might as well live inside the `PhysicalExpr` (which would then remove a level indirection when we are actually doing the analysis).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical closed issue #3898: A framework for expression boundary analysis (and statistics)

Posted by GitBox <gi...@apache.org>.

isidentical closed issue #3898: A framework for expression boundary analysis (and statistics)
URL: https://github.com/apache/arrow-datafusion/issues/3898


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org