You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/10 09:32:07 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new issue #1972: DataFusion Optimizer framework discussion

mingmwang opened a new issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Current DataFusion's optimizer rules are complex and not easy to implement. I think one of the reason is each optimization rule has to deal with the plan tree traversing, pattern matching and rewriting. It is more a visitor model (like the old Presto optimizer rules).
And if we add new logical plan nodes, especially non-leaf plan nodes, it is very possible that all the rules need to be modified to add the related visit logic.

An alternative approach is we can have the Optimizer itself drive the plan tree traversing and apply the rules down/up the tree, it will make the rule logic much clear and can focus on pattern matching and rewriting.

I see there is a new experimental optimizer framework which leverages the egg lib.
[https://github.com/apache/arrow-datafusion/issues/440](url)
I'm not familiar with egg and not sure how mature the egg lib is.

Please share your thoughts.

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1070462090


   > > I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
   > 
   > I agree, egg is potential. 
   > Seems egg will generate DAGs than tree?  @Dandandan 
   
   The current version extended by @pjmore lives here (tokomak directory): https://github.com/pjmore/arrow-datafusion/tree/tokomak-optimizer
   
   I feel like we should move it to datafusion-contrib for better exposure and continue the work there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1064152946


   I was imagining an actual mutator like this:
   
   ```rust
   trait LogicalPlanRewriter {
      pub fn rewrite(plan: LogicalPlan) -> Result<LogicalPlan>
   }
   ```
   
   that would handle the traversal into children so that each optimizer would look something like
   
   ```rust
   impl LogicalPlanRewriter for MyOptimizerPass {
     fn rewrite(plan: LogicalPlan) -> Result<LogicalPlan> {
       match plan {
         // special handling for the types of plans the optimizer cares about
         LogicalPlan::Filter(..) => {...}
         ...
         // default is no rewrite
         _ => Ok(plan)
      }
   }
   ```
   
   I vaguely remember trying to write something like this up once, but as I recall I got stuck on the fact that `LogicalPlans` can have shared inputs (there are `Arcs` ?) but I can't remember exactly and it was a while ago. 
   
   
   
   A slight variant on the above that looks more like a typical vistitor would be something that handled each type of LogicalPlan and defaults to itself and could save the switch statement:
   
   
   ```rust
   trait LogicalPlanRewriter {
      // rewrite a LogicalPlan filter node; Default implementation returns the same plan
      pub fn rewrite_filter(filter: Filter) -> Result<LogicalPlan> {
       Ok(LogicalPlan::Filter(filter)
      } 
   
      // rewrite a LogicalPlan table_scan node; Default implementation returns the same plan
      pub fn rewrite_filter(table_scan: TableScan) -> Result<LogicalPlan> {
       Ok(LogicalPlan::TableScan(table_scan)
      } 
   
      ...
   }
   ```
   
   What do you think?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1078116201


   > But this seems like such an obvious solution that I'm positive that there are probably a number of fundamental issues with it beyond the of the selectivity being made into a N-dimensional distribution where N is the number of predicates.
   
   The typical problem here is skewed data (e.g. one value having 50% of the rows) -- and do you pick equi width or equi height histograms, which both have tradeoffs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069557904


   I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
   
   Some nice things that make `egg` (besides written in Rust) a nice candidate for DataFusion:
   
   * Optimized algorithm hard to match with manual written optimization passes
   * Easier and less verbose to add simple rules
   * Plugin framework to add slightly more complex optimizations
   * Does not depend on rule order and combined with being able to apply multiple rules in one pass and until convergence, can optimize further than currently is possible (or what would be possible using a optimization strategy like Apache Spark)
   * Cost-based optimization easy to add (this is a native feature)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1070462090


   > > I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
   > 
   > I agree, egg is potential. 
   > Seems egg will generate DAGs than tree?  @Dandandan 
   
   The current version extended by @pjmore lives here (tokomak directory). 
   
   I feel like we should move it to datafusion-contrib for better exposure and continue the work there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1077912042


   One thing I think is missing in this discussion is that the cost models for plans are notoriously unreliable (especially one a plan has like 4+ joins in it) and suffer from well known, but unsolved problems like predicate correlation, data skew, and others. 
   
   So in other words, even if we had some super advanced optimization framework that could find the "lowest cost plan" it may turn out to be a very poor plan in practice (I spent many years working on an industrial query optimizer -- link below for some flavor). Real optimizers tend to have heuristics to constrain the search space and try (very) hard to avoid disaster plans / join orders. 
   
   Applying more optimizations at the physical level makes sense to me
   
   https://ieeexplore.ieee.org/document/6816727/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069850427


   > > > I'm planning to add the `volcano/cascades optimizer` framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.
   > > 
   > > 
   > > Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:
   > > 
   > > 1. It requires a solid and as accurate as possible cost-based optimization.
   > > 2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
   > > 3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
   > > 4. I prefer to continue experimenting with egg, @pjmore has done many works [Add experimental Tokomak optimizer #1485](https://github.com/apache/arrow-datafusion/pull/1485)
   > 
   > I just mean that add a like top to down optimizer instead speacify optimizer
   
   I know -- volcano/cascades optimizer framework both are top-down optimizers, my thoughts covered both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1070930527


   I will add that repo to datafusion contrib announcement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1064143265


   Thank you for bringing this up @mingmwang. My view is that the current optimizer framework could definitely use improving 👍 
   
   > t is more a visitor model (like the old Presto optimizer rules).
   
   I think I would say the current optimizer rules "aspire" to be a visitor model; Right now the interface for an optimizer looks like this:
   
   ```rust
   pub trait OptimizerRule {
       /// Rewrite `plan` to an optimized form
       fn optimize(
           &self,
           plan: &LogicalPlan,
           execution_props: &ExecutionProps,
       ) -> Result<LogicalPlan>;
   
       /// A human readable name for this optimizer rule
       fn name(&self) -> &str;
   }
   ```
   which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as `optimize_children` and `from_plan` but I would say they are not particularly ergonomic
   
   So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting `LogicalPlan`. We have done this for `Expr` rewriting  [expr_rewriter.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr_rewriter.rs) and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: [simplify_expressions.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs)
   
   
   cc @Dandandan and @realno   who may also have idea on this
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener edited a comment on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener edited a comment on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069339073


   > which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as optimize_children and from_plan but I would say they are not particularly ergonomic
   >
   > So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting LogicalPlan. We have done this for Expr rewriting [expr_rewriter.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr_rewriter.rs) and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: [simplify_expressions.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs)
   
   I also found these problems when I enhance the optimizer.
   
   It's a significant proposal, I'm focus this part currently. 
   
   I'm planning to add the `volcano/cascades optimizer` framework  for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1070483207


   Love this discussion ❤️ 
   
   I really like the idea of being able to use the DataFusion optimizer logic in a general way. It would be awesome to use DataFusion with various different optimization approaches, depending on usecase.
   
   I think if a solution can handle the existing optimizer passes it is likely to be general enough for all future optimization passes. 
   
   A required step for any of these solutions, I think, is to more carefully separate the plan traversal from the existing plan logic, so that seems like an obvious first step to take in the DataFusion code base. trying to move egg based optimization to datafusion-contrib as suggested by @Dandandan  sounds like a great idea, and we can use that experience to iterate on the best interface in DataFusion 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069339073


   > which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as optimize_children and from_plan but I would say they are not particularly ergonomic
   >
   > So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting LogicalPlan. We have done this for Expr rewriting [expr_rewriter.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr_rewriter.rs) and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: [simplify_expressions.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs)
   
   I also found these problems when I enhance the optimizer.
   
   It's a significant proposal, I'm focus this part currently. 
   
   I'm planning to add the `volcano/cascades optimizer` framework  for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 edited a comment on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

xudong963 edited a comment on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069816879


   > I'm planning to add the `volcano/cascades optimizer` framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.
   
   Cascades is an optimization framework that uses a cost-based approach to explore possible executable
   of the search space. So I have some my thoughts:
   1. It requires a solid and as accurate as possible cost-based optimization.
   2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
   3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
   4. I prefer to continue experimenting with egg, @pjmore has done many works https://github.com/apache/arrow-datafusion/pull/1485, fyi https://github.com/apache/arrow-datafusion/issues/440


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1077912042


   One thing I think is missing in this discussion is that the cost models for plans are notoriously unreliable (especially once a plan has like 4+ joins in it) and suffer from well known, but unsolved problems like predicate correlation, data skew, and others. 
   
   So in other words, even if we had some super advanced optimization framework that could find the "lowest cost plan" it may turn out to be a very poor plan in practice (I spent many years working on an industrial query optimizer -- link below for some flavor). Real optimizers tend to have heuristics to constrain the search space and try (very) hard to avoid disaster plans / join orders. 
   
   Applying more optimizations at the physical level makes sense to me
   
   https://ieeexplore.ieee.org/document/6816727/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069816879


   > I'm planning to add the `volcano/cascades optimizer` framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.
   
   Cascades is an optimization framework that uses a cost-based approach to explore possible executable
   of the search space. So I have some my thoughts:
   1. It requires a solid and as accurate as possible cost-based optimization.
   2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
   3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
   4. I prefer to continue experimenting with egg, @pjmore has done many works https://github.com/apache/arrow-datafusion/pull/1485


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069805397


   > I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
   
   I agree, egg is potential. 
   Seems egg will generate DAGs than tree?  @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener edited a comment on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener edited a comment on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069822792


   I,m sorry that I don,t know more about agg, I also try it to do something for datafusion
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1078112726


   I would love to try and avoid  such issues by moving as many choices as possible to runtime. Specifically for Join implementation choice I would love to try to implement something like this
   
   https://www.semanticscholar.org/paper/A-Generalized-Join-Algorithm-Graefe/2df1e7f01621a21061c5ef44e9cca7f0cfdbb332
   
   I don't have a great story for join ordering (though I am working on a potential paper on the topic of more dynamic join orderings)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069823670


   @Dandandan hi, about agg, there is some design doc or some discussion about it？ I want to know more about it in order to contirbute for enhancing our optimizer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069822792


   I,m sorry that I don,t know more about agg, I also have try it to do something for datafusion
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1070930527


   I will add that repo to datafusion contrib announcement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069744802


   I definitely agree the optimizer framework needs some attention. Specifically, each rule needs to handle tree traversal which is quite tedious and also less efficient due to the repeated full tree traversal. Also it is error prone, if not careful it may introduce dependency on the order of the rules in code. 
   
   I am leaning towards deciding which framework to go with asap for the following reasons: 1. we have limited rules for now so it is small effort to change it now. 2. simplifying rule implementation can encourage more people contributing in this area. 3. this may also help benchmarking and optimization work.
   
   I saw a few options were brought up, we can make a list for ones worth investigating.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] pjmore commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

pjmore commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1077041378

TLDR: egg optimizer struggles with some things which have many representations and suffers due to poor rewrite scheduling. Has some nice properties, likely able to implement cascades based framework on top of egg. Egg might be better suited to planner/physical operating like cascades does, lowering the logical plan into physical operators.

I've been researching cascades a bit recently and here are some general thoughts on optimizer stuff. This is kind of off the top of my head with regards to the egg stuff so it might be disorganized since I haven't thought about it for a bit.

1. The current egg based optimizer works alright, but suffers heavily in certain situations due to rule scheduling issues/the number of potential queries exploding. Structured rule application like cascade does might help with that.
2. In a related issue egg does not handle large expressions/trees well due to the large number of potential representations. This is apparent when the filter predicate in TPCH 19 which is a fairly deep and wide expression. In release mode it can take ~5 only focused on optimizing the filter to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR C OR D). Egg does not record which rules only need to be applied once which can cause a rule like A OR B -> B OR A to be run on already processed nodes doing nothing. This is mostly an issue when rules that are going to match a lot and have to match over a large number of equivalent representations. Extremely commonly a2. In a related issue egg does not handle large expressions/trees well due to the large number of potential representations. This is apparent when the filter predicate in TPCH 19 which is a fairly deep and wide expression. In release mode it can take ~5 only focused on optimizing the filt
er to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR C OR D). Egg does not record which rules only need to be applied once which can cause a rule like A OR B -> B OR A to be run on already processed nodes doing nothing. This is mostly an issue when rules that are going to match a lot and have to match over a large number of equivalent representations. Extremely commonly applied rules that are almost always positive such the boolean simplification in TPCH 19 should probably exist as specialized optimization rules as they exist in their current form as they are likely to be faster. pplied rules that are almost always positive such the boolean simplification in TPCH 19 should probably exist as specialized optimization rules as they exist in their current form as they are likely to be faster.
3. Egg provides a lot of the properties that are desirable for a query optimizer, namely it already has a fast memo data structure and provides cost memoized cost function calculation. I'm unsure how well something like join order could would run within egg though.
4. Egg based optimizer might be better suited to physical optimization or as a planner where table statistics can be considered.
5. I think that the egg based optimizer and cascades actually a number of similarities, namely considering the memo datastructure that cascades uses.

Here is an image from [tikdb's article on their cascade optimizer](https://developpaper.com/unveiling-tidb-new-optimizer-analysis-of-cascade-planner-principle) that demonstrates general idea of a memo data structure.

![memo explaination](https://imgs.developpaper.com/imgs/2310010508-c90cfa59d4bf9972_articlex.jpg)

And from [egg's website](https://egraphs-good.github.io/), this image doesn't show up well on dark themed github so check it out on the website if you can't see it properly.

![Egg's demo](https://egraphs-good.github.io/egraphs.svg)

The dotted lines around expression are effectively the same thing as cascade's groups. I think egg might be slightly better for this as groups are automatically merged together if they share equivalent terms, I'm not sure if this is something that is typically done in cascades based optimizer.

Another [example from the CMU advanced database course]( https://15721.courses.cs.cmu.edu/spring2017/slides/15-optimizer2.pdf). The relevant slides are around slide 34.

You can see that an egraph, which egg uses, and the memo that cascade uses have a bunch of similar properties.
- Single node can be represented in many different ways.
- Parent <-> child relation ship is loosely coupled. Parent has group as child which can have multiple different representations.

I think that it is possible to build a cascade optimizer using egg as a fast memo structure. If we are going to build a more advanced optimization framework I think it makes sense to focus a bit more on the physical optimizations as nearly all of the logical optimizations could be done on the physical expressions instead, and optimizations like join order, which require table statistics/sampling, can cause order of magnitudes of difference in query performance. I think if this is a route that people are interested in pursuing this makes sense to develop a egg based planner which lowers the logical plan into a physical one, but I'd love to hear other people's thoughts on this.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1064154803


   If I were going to do this, I would probably sketch out a PlanRewriter and then try to port one or two of the existing optimizers to use it (without changing the Optimizer trait at first) as a proof of concept. Then as follow on PRs I would port over the remaining optimizers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1064143265


   Thank you for bringing this up @mingmwang. My view is that the current optimizer framework could definitely use improving 👍 
   
   > t is more a visitor model (like the old Presto optimizer rules).
   
   I think I would say the current optimizer rules "aspire" to be a visitor model; Right now the interface for an optimizer looks like this:
   
   ```rust
   pub trait OptimizerRule {
       /// Rewrite `plan` to an optimized form
       fn optimize(
           &self,
           plan: &LogicalPlan,
           execution_props: &ExecutionProps,
       ) -> Result<LogicalPlan>;
   
       /// A human readable name for this optimizer rule
       fn name(&self) -> &str;
   }
   ```
   which as you say requires each optimizer rule to handle the recursion itself. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as `optimize_children` and `from_plan` but I would say they are not particularly ergonomic
   
   So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting `LogicalPlan`. We have done this for `Expr` rewriting  [expr_rewriter.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr_rewriter.rs) and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: [simplify_expressions.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/simplify_expressions.rs)
   
   
   cc @Dandandan and @realno   who may also have idea on this
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069349269


   But, It's a big change, we can do it in the long term.
   
   Currently, I'm planning enhance the optimizer gradually,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jackwener commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

jackwener commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1069824463


   > > I'm planning to add the `volcano/cascades optimizer` framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.
   > 
   > Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:
   > 
   > 1. It requires a solid and as accurate as possible cost-based optimization.
   > 2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
   > 3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
   > 4. I prefer to continue experimenting with egg, @pjmore has done many works [Add experimental Tokomak optimizer #1485](https://github.com/apache/arrow-datafusion/pull/1485)
   
   I just mean that add a like top to down optimizer instead speacify optimizer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

mingmwang commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1063850579


   @alamb @andygrove Please share your thoughts and insights.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1070526699


   I pushed the latest version of @pjmore to
   https://github.com/datafusion-contrib/datafusion-tokomak
   
   note that this includes the full datafusion source tree, as some changes in there are required too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] crepererum commented on issue #1972: DataFusion Optimizer framework discussion

Posted by GitBox <gi...@apache.org>.

crepererum commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1078838974


   # Egg
   @pjmore great write up, thank you. I have a few notes on your observations regarding egg:
   
   1. It might be worth sharing them with the egg developers, they have a [quite active discussion forum](https://github.com/egraphs-good/egg/discussions) and seem to be quite eager to hear what people build using egg (or where they fail)
   2. I think the explosion of graph sizes can be handled somewhat by tuning the cost function, by match guards (like the ones that are also used to to prevent mis-optimization of division by zero) or by pruning (their math examples uses [this trick](https://github.com/egraphs-good/egg/blob/f1238e3a95c7e4e65ce01384f2f88198d26657f9/tests/math.rs#L107-L108))
   3. The `(A OR B) -> (B OR A)` rule you bring up is actually an interesting one. When experimenting with egg I found that these symmetry rules are really helpful to write a maintainable, short, and easy to understand rules set. Imagine the rule `(A OR B) AND (A OR C) -> A OR (B AND C)`. This rule would only work properly if you combine it with the symmetry rule or you end up enumerating all "permutations" of the simplification rule. Now we might conclude that the symmetry rule should only be applied once or not on processed nodes, however for nesting it's quite important to be able to re-apply it to new nodes as well.
   4. I think the egg `Analysis` trait is extremely powerful, even a bit too complex since it can do multiple things like analysis (which then can be used by rules to prevent optimizations, see [this example](https://github.com/egraphs-good/egg/blob/f1238e3a95c7e4e65ce01384f2f88198d26657f9/tests/math.rs#L144-L153)), complex optimization passes (e.g. constant folding) and pruning.
   
   # Multi-objective optimization
   Another note multi-objective optimizations:
   
   From experience as well as mid-level understanding of optimization in general, having multiple optimizations targets is always tricky (but depending on the use case not unsolvable).
   
   If you really wanna keep the goals separate, then comparing solutions to your issue can only be done by strict comparison (e.g. plan A is better than B if it has lower cast AND risk). More technically, you stick to the [Pareto Front](https://en.wikipedia.org/wiki/Pareto_front).
   
   However this often steals you quite some potential, since massive improvements in one metric might result in slight disadvantages in another metric (e.g. you might be able shave of 90% cost by increasing the risk by 5%). Now most people would intuitively decide to take the risk here. So how do you balance the two metrics? What's often done is a linear combination (like `total = A * risk + B * cost`, `A` and `B` being tuning parameters). This obviously has its own issue (like all metrics must "scale" in the same linear way, tuning parameters are tricky to get right).
   
   Also see [Multiple-criteria decision-making](https://en.wikipedia.org/wiki/Multiple-criteria_decision_analysis).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org