You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/29 10:21:02 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #440: Future of experimental optimizer datafusion-tokomak

Dandandan opened a new issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   This issue is for discussing the future of [datafusion-tokomak](https://github.com/Dandandan/datafusion-tokomak), an experimental optimizer using the [egg](https://github.com/egraphs-good/egg) library.
   
   It currently allows to optimize `Expr`s and contains many optimizations currently not done in DataFusion.
   I envision it could be extended to support a logical plan or physical plan too.
   
   The optimizer using egg has the following nice properties, which are hard to achieve otherwise:
   
   * Development of new rules is really easy, most of them can be added in one line, or they could hook up to a trait.
   * Performs more aggressive optimizations than a handwritten optimizer, as it applies multiple rules at once, and can apply and remember rewrites that don't reduce the cost.
   * Supports custom cost functions. It now uses the size of the AST, but could be easily changed to use something else.
   * Is fast for big programs / trees.
   * Is written in Rust, so integrates really well
   
   Some material about it here https://egraphs-good.github.io/
   
   **Describe the solution you'd like**
   Some options:
   
   *Integrate it into DataFusion, as an optional feature*
   
   *Add to DataFusion as separate crate*
   
   *Keep it in separate repo as is, do some releases to crates.io in sync with DataFusion releases*
   
   *Add as experimental repo / branch under the Apache organization *
   
   **Describe alternatives you've considered**
   n/a
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #440: Future of experimental optimizer datafusion-tokomak

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440#issuecomment-968183228


   Thanks @pjmore . The DSL looks pretty great, I was thinking about this while reviewing your initial PR the other day. Perhaps something you can upstream to egg as well?
   
   I also agree with @Dandandan that we can merge your code into the main repo early without having to wait for it to be feature complete. Can't wait for this cool feature to land in master :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #440: Future of experimental optimizer datafusion-tokomak

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440#issuecomment-854011755


   fwiw:
   
   * Thanks a lot for this!
   * I fully agree
   * Ship it as you see fit; let me know if you get blocked by any bureaucracy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #440: Future of experimental optimizer datafusion-tokomak

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440#issuecomment-962871520


   Thanks for this @pjmore . Great milestone to have the new query working. Also wondering if some other cross joins are removed too :).
   I think at some point it makes sense to "just add" the code to the repo so it becomes easier to continue experimenting. Feel free when you believe it's time to do so :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore commented on issue #440: Future of experimental optimizer datafusion-tokomak

Posted by GitBox <gi...@apache.org>.
pjmore commented on issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440#issuecomment-962503478


   Some good and bad news on this front. The Tokomak optimization pass combined with a predicate pushdown pass and a filter<-cross join to filter<-inner join pass is able to handle TPCH Q19 with each iteration taking ~66ms on my laptop. It performs the AstSize cost function transforms the expression into the form that @alamb mentioned in #217. Which gives the optimized logical plan:
   ```
   Projection: #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)
     Aggregate: groupBy=[[]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount)]]
       Projection: #lineitem.l_extendedprice, #lineitem.l_discount
         Filter: Boolean(true)
           Filter: #part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND #part.p_size BETWEEN Int64(1) AND Int64(10) AND #part.p_brand = Utf8("Brand#23") AND #lineitem.l_quantity BETWEEN Int64(10) AND Int64(20) OR #part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND #part.p_brand = Utf8("Brand#34") AND #part.p_size BETWEEN Int64(1) AND Int64(15) AND #lineitem.l_quantity BETWEEN Int64(20) AND Int64(30) OR #lineitem.l_quantity BETWEEN Int64(1) AND Int64(11) AND #part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND #part.p_brand = Utf8("Brand#12") AND #part.p_size BETWEEN Int64(1) AND Int64(5)
             Join: #lineitem.l_partkey = #part.p_partkey
               Filter: #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
                 TableScan: lineitem projection=Some([1, 4, 5, 6, 13, 14])
               TableScan: part projection=Some([0, 3, 5, 6])
   
   ```
   
   Bad news is that I had to use a dev branch of egg due to Send requirements so merging the Tokomak optimizer may have to wait until the next version of egg releases. On the bright side there is plenty of work  that needs to be done before I feel the optimizer is ready anyways:
   * Allow users to create their own rules that will run alongside the existing rules.
   * Allow users to create custom cost functions.
   * Allow users to run optimizer with a custom rewrite scheduler.
   * Extend the optimizer to allow it to operate on plans as well as expressions.
   * Add the Case expr, this particular expression does not mesh well with current rewrite syntax.
   * Create DSL which is better suited for rewriting expressions which involve lists and expressions that have to be of a particular type. The current rule syntax falls short when rewriting to or from lists, there is no real way to implement the rule
   ``` col1 = 'one' or col1 = 'two' or col2='three' => col1 InList('one','two','three')```
   An example of the need for typed expressions would be when rewriting a filter followed by a cross join into a inner join, as the optimization below is only valid when the :
   ```
   Filter: #lineitem.l_partkey = #part.p_partkey
       CrossJoin: 
           TableScan: part
           TableScan: lineitem
   -> 
   Join:  #lineitem.l_partkey = #part.p_partkey
       TableScan: part
       TableScan: lineitem
   ```
   as the in this case both the left and the right of the filter predicate must be columns for this to be a valid transform.
   Another capability that would be nice to have would be to bind to operators in the same way as expression so instead of the rules for rotating the expression tree:
   ```
   rw!("rotate-add"; "(+ ?x (+ ?y ?z))"=>"(+ (+ ?x ?y) ?z)"),
   rw!("rotate-mul"; "(* ?x (* ?y ?z))"=>"(* (* ?x ?y) ?z)"),            
   rw!("rotate-and"; "(and ?a (and ?b ?c))"=>"(and (and ?a ?b) ?c)"),
   rw!("rotate-or"; "(or ?a (or ?b ?c))"=>"(or (or ?a ?b) ?c)"),
   ```
   Could instead write something along the lines of:
   ```
   rw!("rotate-expr-homogenous"; "(?op:(+|*|and|or) ?x (?op ?y ?z))" =>"(?op (?op ?x ?y) ?z)")
   ```
   
   * Determine whether combining expression and plan optimization at the same time is viable. Would probably require custom scheduler. Would complicate writing cost function but it might allow better optimizations to be performed. 
   
   I'm going to keep working on this in a separate repo for now because it's a long way from being what I would consider being complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #440: Future of experimental optimizer datafusion-tokomak

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440#issuecomment-854011755


   fwiw:
   
   * Thanks a lot for this!
   * I fully agree
   * Ship it as you see fit; let me know if you get blocked by any bureaucracy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #440: Future of experimental optimizer datafusion-tokomak

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #440:
URL: https://github.com/apache/arrow-datafusion/issues/440#issuecomment-962505988


   Thank you for the update @pjmore  -- sounds like some great progress


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org