You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 13:23:11 UTC

[GitHub] [arrow-datafusion] alamb commented on issue #116: Address limitations of logical expression rewrite logic

alamb commented on issue #116:
URL: https://github.com/apache/arrow-datafusion/issues/116#issuecomment-826831711

Comment from Daniël Heres(Dandandan) @ 2020-11-24T18:56:34.496+0000:
<pre>A summary of some points I found when coming up with a new framework for polars ([https://github.com/ritchie46/polars]) that uses some ideas from the Spark catalyst optimizer [1]

Some points I liked about the design from Spark Catalyst:
* Recursion on tree should be only be needed once, and not for each optimization pass. So every rule can be written using simple pattern matching. This can be captured in some kind of framework.
* A large nr. of optimization rules should run a nr. of times to reach a fixed point, i.e. running until the logical plan doesn't change anymore. If doing this, it is important that all of your optimization rules only make the tree "smaller" in some sense. So either should reduce the nr. of nodes or make the plan "cheaper".

The optimizer I made for Polars is in very early stages, but I did some design and first iterations to come up with a first version of a optimization framework.
* In Rust, if elements in your tree are Boxed, you need to clone part of the tree when you want to mutate part of the tree. So simple recursing the tree + mutating it in scala is not possible without changing. You could maybe wrap everything in something like Arc/RC <RefCell>>, but this has a higher overhead. You could also generate a whole new tree every iteration. This will however be quite a bit slower, especially if you would do this per optimization rule which can grow a lot!

Some points that a first iteration is different than the optimizer in Spark
* It uses an tree backed by an arena to efficiently allocate data for the tree and mutate it. This means that if you don't generate new nodes, you don't even allocate, just switch some index to different nodes around. Also a tree in a arena is very nice for the locality of the data.

The arena brings a bit more unsafety, as you
* Uses manual recursion (with pre allocated stack) instead of the call stack to recurse (a bit uglier, but if you only write it once can be worth it for performance).

* In Catalyst, only a single optimization rule runs until reaching a fixed point, and then moves to . In the Polars version, all rules run in the inner loop until the whole optimization reaches a fixed point. Benefit is that you don't have to make sure the order of the rules is important. Also it can bring _more_ optimizations, as e.g. a rule to evaluate some expressions can have an effect on a rule to propagate nulls that can have an effect on predicate pushdown, etc.

* In Catalyst you have to note whether your optimization needs to recurses topdown or bottom up (for example more useful to constant folding as otherwise you would need lots of iterations to fold a complex contant expression). In this optimizer, the optimizer does both itself, by also optimizing a node right after it changed. This means that the optimizer needs to do perform iterations in general, and you need to think less about it.

TODO for design in Polars:
* Some optimization rules can be more expensive than others. It might make sense to keep track of each node individually to check whether it changed
* Different optimization rules need different input, like the schema/type of a column, etc.
* Some optimizations need to keep track of state, this is not yet handled in this optimizer.

[1 ]http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf</pre>

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org