You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/24 03:54:39 UTC

[GitHub] [arrow-datafusion] pjmore commented on issue #1972: DataFusion Optimizer framework discussion

pjmore commented on issue #1972:
URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1077041378

TLDR: egg optimizer struggles with some things which have many representations and suffers due to poor rewrite scheduling. Has some nice properties, likely able to implement cascades based framework on top of egg. Egg might be better suited to planner/physical operating like cascades does, lowering the logical plan into physical operators.

I've been researching cascades a bit recently and here are some general thoughts on optimizer stuff. This is kind of off the top of my head with regards to the egg stuff so it might be disorganized since I haven't thought about it for a bit.

1. The current egg based optimizer works alright, but suffers heavily in certain situations due to rule scheduling issues/the number of potential queries exploding. Structured rule application like cascade does might help with that.
2. In a related issue egg does not handle large expressions/trees well due to the large number of potential representations. This is apparent when the filter predicate in TPCH 19 which is a fairly deep and wide expression. In release mode it can take ~5 only focused on optimizing the filter to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR C OR D). Egg does not record which rules only need to be applied once which can cause a rule like A OR B -> B OR A to be run on already processed nodes doing nothing. This is mostly an issue when rules that are going to match a lot and have to match over a large number of equivalent representations. Extremely commonly a2. In a related issue egg does not handle large expressions/trees well due to the large number of potential representations. This is apparent when the filter predicate in TPCH 19 which is a fairly deep and wide expression. In release mode it can take ~5 only focused on optimizing the filt
er to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR C OR D). Egg does not record which rules only need to be applied once which can cause a rule like A OR B -> B OR A to be run on already processed nodes doing nothing. This is mostly an issue when rules that are going to match a lot and have to match over a large number of equivalent representations. Extremely commonly applied rules that are almost always positive such the boolean simplification in TPCH 19 should probably exist as specialized optimization rules as they exist in their current form as they are likely to be faster. pplied rules that are almost always positive such the boolean simplification in TPCH 19 should probably exist as specialized optimization rules as they exist in their current form as they are likely to be faster.
3. Egg provides a lot of the properties that are desirable for a query optimizer, namely it already has a fast memo data structure and provides cost memoized cost function calculation. I'm unsure how well something like join order could would run within egg though.
4. Egg based optimizer might be better suited to physical optimization or as a planner where table statistics can be considered.
5. I think that the egg based optimizer and cascades actually a number of similarities, namely considering the memo datastructure that cascades uses.

Here is an image from [tikdb's article on their cascade optimizer](https://developpaper.com/unveiling-tidb-new-optimizer-analysis-of-cascade-planner-principle) that demonstrates general idea of a memo data structure.

![memo explaination](https://imgs.developpaper.com/imgs/2310010508-c90cfa59d4bf9972_articlex.jpg)

And from [egg's website](https://egraphs-good.github.io/), this image doesn't show up well on dark themed github so check it out on the website if you can't see it properly.

![Egg's demo](https://egraphs-good.github.io/egraphs.svg)

The dotted lines around expression are effectively the same thing as cascade's groups. I think egg might be slightly better for this as groups are automatically merged together if they share equivalent terms, I'm not sure if this is something that is typically done in cascades based optimizer.

Another [example from the CMU advanced database course]( https://15721.courses.cs.cmu.edu/spring2017/slides/15-optimizer2.pdf). The relevant slides are around slide 34.

You can see that an egraph, which egg uses, and the memo that cascade uses have a bunch of similar properties.
- Single node can be represented in many different ways.
- Parent <-> child relation ship is loosely coupled. Parent has group as child which can have multiple different representations.

I think that it is possible to build a cascade optimizer using egg as a fast memo structure. If we are going to build a more advanced optimization framework I think it makes sense to focus a bit more on the physical optimizations as nearly all of the logical optimizations could be done on the physical expressions instead, and optimizations like join order, which require table statistics/sampling, can cause order of magnitudes of difference in query performance. I think if this is a route that people are interested in pursuing this makes sense to develop a egg based planner which lowers the logical plan into a physical one, but I'd love to hear other people's thoughts on this.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org