You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/27 11:31:23 UTC

[GitHub] [arrow-datafusion] liurenjie1024 opened a new issue, #2633: Introducing a new optimizer framework for datafusion.

liurenjie1024 opened a new issue, #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633

   There are some discussions about datafusion's optimizer framework in #440 and #1972. And I tried to build a framework based datafusion's expression system with the following features:
   
   1. Includes a heuristic optimizer that applies rules to plan iteratively in different ways: top-down or bottom-up. The optimizer stops applying rules to the plan until reaching a fixpoint or max iteration times.
   2. Includes a cascades style cost-based optimizer framework.
   3. Pattern matching and binding for rules. 
   
   Following is an example of removing limit rule, including patten definition and rule implementation:
   ```
   pattern(|op| matches!(op, Logical(LogicalLimit(_))))
       .leaf(|op| matches!(op, Logical(LogicalProjection(_))))
   .finish()
   ```
   ```
   if let (Logical(LogicalLimit(limit1)), Logical(LogicalLimit(limit2))) =
               (input.get_operator(&_ctx)?, input[0].get_operator(&_ctx)?)
   {
       let new_limit = min(limit1.limit(), limit2.limit());
   
       let ret = input[0].clone_with_inputs(Logical(LogicalLimit(Limit::new(new_limit))));
   
       result.add(ret);
       Ok(())
   } else {
       bail!("Pattern miss matched")
   }
   ```
   
   And the source code can be found here: https://github.com/liurenjie1024/rust-opt-framework
   
   Welcome to discuss and share your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1196202930

   Welcome to join discussion and development in  new repo https://github.com/datafusion-contrib/datafusion-dolomite, and we can close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1193997008

   Sounds like a great idea to me -- sorry I haven't had a chance to review this @liurenjie1024  -- what would you like to call the repo in datafusion-contrib? I can make one for you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1169784027

   Hi, @alamb @andygrove I've finished a simple poc and you can find the code here: https://github.com/liurenjie1024/rust-opt-framework/tree/main/src/datafusion_poc
   
   Here are the general ideas:
   
   1.  To adopt new heuristic optimizer,  we can wrap `HeuristicOptimizer`  as a optimizer rule, and it works as following:
   ```
   Datafusion Logical Plan -> Our Logical Plan -> HeuristicOptimizer -> Our Logical Plan -> Datafusion Logical Plan
   ```
   You can find an implementation here:
   https://github.com/liurenjie1024/rust-opt-framework/blob/main/src/datafusion_poc/rule.rs
   
   2. To adopt new cascades style cost based optimizer, we can implement a new `QueryPlanner`, which works as following:
   ```
   Datafusion logical plan -> Our logical plan -> Cost based optimizer -> Our physical plan -> Datafusion physical plan
   ```
   You can find implementation here:
   https://github.com/liurenjie1024/rust-opt-framework/blob/main/src/datafusion_poc/planner.rs
   
   3. For robust behavior of cbo without statistics, I prefer to use trivial cost model. For example, add penalty for operators like sort, nest loop join, etc. Currently I don't have implementation for this, but I think the optimizer framework is flexible enough and we can add them later.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1140226908

   I looked at https://github.com/liurenjie1024/rust-opt-framework for bit today -- it looks very neat and a good example of a more general purpose optimization framework.
   
   I would personally be very interested in seeing an Proof Of Concept of this framework connected into DataFusion (aka instead of the existing hard coded ordering https://github.com/apache/arrow-datafusion/blob/894be6719373be85fa777028fe3ec534536660e3/datafusion/core/src/execution/context.rs#L1259-L1278)
   
   In terms of features I think are valuable for DataFusion are:
   1. Easy to understand default behavior that is robust (in the sense it doesn't degrade horribly without statistics, etc)
   2. Customizable: some way to allow users of DataFusion to mix/match / mashup the existing optimizer passes and their own to tailor DataFusion to their own use 
   
   There are probably more things I haven't thought of yet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1194039185

   I created https://github.com/datafusion-contrib/datafusion-dolomite and invited you as a maintainer -- let me know if you would like / want admin access as well to that repo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 closed issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 closed issue #2633: Introducing a new optimizer framework for datafusion.
URL: https://github.com/apache/arrow-datafusion/issues/2633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1142833027

   Thanks for raising this @liurenjie1024. I am very interested in this effort since I will likely be spending time contributing to the optimizer rules in the coming weeks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1171330375

   Thanks @liurenjie1024 I will review this next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1193584924

   I would like to donate this optimizer to datafusion-contrib so that we can develop it with community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1182809782

   @andygrove  @alamb PTAL when you are available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1140231370

   > I looked at https://github.com/liurenjie1024/rust-opt-framework for bit today -- it looks very neat and a good example of a more general purpose optimization framework.
   > 
   > 
   > 
   > I would personally be very interested in seeing an Proof Of Concept of this framework connected into DataFusion (aka instead of the existing hard coded ordering https://github.com/apache/arrow-datafusion/blob/894be6719373be85fa777028fe3ec534536660e3/datafusion/core/src/execution/context.rs#L1259-L1278)
   > 
   > 
   > 
   > In terms of features I think are valuable for DataFusion are:
   > 
   > 1. Easy to understand default behavior that is robust (in the sense it doesn't degrade horribly without statistics, etc)
   > 
   > 2. Customizable: some way to allow users of DataFusion to mix/match / mashup the existing optimizer passes and their own to tailor DataFusion to their own use 
   > 
   > 
   > 
   > There are probably more things I haven't thought of yet
   
   Good suggestion, I will write a poc to integrate with datafusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1194019536

   > Sounds like a great idea to me -- sorry I haven't had a chance to review this @liurenjie1024  -- what would you like to call the repo in datafusion-contrib? I can make one for you
   
   Thanks for response. I noticed that there are already one called datafusion-optimizer, how about calling it datafusion-optimizer2?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1194052493

   > I created https://github.com/datafusion-contrib/datafusion-dolomite and invited you as a maintainer -- let me know if you would like / want admin access as well to that repo
   
   Thanks, I'll have a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1183623515

   Hi @liurenjie1024  -- I'll try and find some time this weekend to review this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1140049977

   THanks @liurenjie1024  -- I ran out of time today to review this but I will try and find time tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liurenjie1024 commented on issue #2633: Introducing a new optimizer framework for datafusion.

Posted by GitBox <gi...@apache.org>.
liurenjie1024 commented on issue #2633:
URL: https://github.com/apache/arrow-datafusion/issues/2633#issuecomment-1139534342

   cc @mingmwang @alamb @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org