You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/09 18:47:52 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #5535: Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)

alamb opened a new issue, #5535:
URL: https://github.com/apache/arrow-datafusion/issues/5535

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   At the time of writing , DataFusion has **three** partially complete range/interval analysis implementations:
   
   ## Range analysis used in cardinality estimation 
   https://docs.rs/datafusion-physical-expr/18.0.0/datafusion_physical_expr/trait.PhysicalExpr.html#method.analyze
   https://docs.rs/datafusion-physical-expr/18.0.0/datafusion_physical_expr/struct.AnalysisContext.html
   (cc @isidentical)
   
   ## Pruning Predicate
   https://docs.rs/datafusion/18.0.0/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html 
   (creates an expr and then evaluates it against statistics)
   
   # Intervals
   https://github.com/apache/arrow-datafusion/blob/6a8c4a60114c7132de6d2fcd6d44111cf40b3df9/datafusion/physical-expr/src/intervals/mod.rs#L24
   Introduced by @metesynnada  and @ozankabak  in https://github.com/apache/arrow-datafusion/pull/5322
   
   
   The modules are all structured a differently / have different implementations but what they are computing (interval / range analysis) is basically the same and they all operate on `PhysicalExpr`s. 
   
   Since there are three implementations, as we add new PhysicalExprs either the work must be triplicated or else (more likely) the implementations will keep diverging in functionality. 
   
   In addition it is hard to increase the scope of coverage when the logic is in different places
   
   **Describe the solution you'd like**
   I would like a single implementation of range analysis used for all three uses in DataFusion. The one introduced by @metesynnada  and @ozankabak in https://github.com/apache/arrow-datafusion/pull/5322 seems to have the best documentation and theoretical foundation, but the other two are more functionally complete.
   
   I don't have a preference for which approach is taken, other than that we end up with a single implementation that is easier to maintain
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   **Additional context**
   
   I believe @ozankabak  said they may have plans in this area (https://github.com/apache/arrow-datafusion/pull/5322#issuecomment-1441990911) but I am not sure 
   
   cc @mingmwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5535: Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5535:
URL: https://github.com/apache/arrow-datafusion/issues/5535#issuecomment-1497623497

   > @alamb I started to work on this, do you have any suggestions on PruningPredicate?
   
   I would personally suggest breaking this project into a few parts:
   1.  consolidating the first two (not the pruning predicate).
   2. consolidating pruning predicate
   
   Doing PruningPredicate needs some thought -- perhaps an incremental approach like figure out how it will work for simple predicates (a < 5) or whatever, and get the basics working. 
   
   Then we can do a series of PRs to port over the rest of the expressions (and tests)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on issue #5535: Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on issue #5535:
URL: https://github.com/apache/arrow-datafusion/issues/5535#issuecomment-1497598144

   @alamb I started to work on this, do you have any suggestions on `PruningPredicate`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #5535: Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5535: Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)
URL: https://github.com/apache/arrow-datafusion/issues/5535


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #5535: Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #5535:
URL: https://github.com/apache/arrow-datafusion/issues/5535#issuecomment-1462611417

   Thank you for opening an issue to track this. We are actively working on extending interval support and will unify the existing code along the way.
   
   FYI, our first step will be supporting more data types (e.g. timestamps, floating point). Then, we will add support for more operations (multiplication and division). There are other remaining steps after these, but at this point we can probably unify the code before adding the other features.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org