You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2023/05/01 21:24:46 UTC

Re: Rewrite rule to convert self-joins into scans

Hi Hanumath,

Thanks for putting this doc together; the examples are helpful and I
find the approach interesting.

CTEs and redundant joins have similarities but I see them as different
problems. CTEs are patterns that appear multiple times in a query but
are not necessarily redundant.

Apart from identifying CTEs I am actually pretty interested in how to
represent them. It may be helpful to add some concrete examples/plans
covering the representation part in the document you shared.

A few weeks ago, I created a doc [1] with some preliminary ideas about
CTEs in the context of Hive covering briefly:
* representation
* identification
* logical -> physical for Hive

It would be great if we can converge in a representation that is
useful for multiple projects and built on top of that. The
identification part is somewhat pluggable and it could be different
depending on the use-case.

Best,
Stamatis

[1] https://github.com/zabetak/docs/blob/main/2023-03-09-Hive-cbo-cte-optimization/index.md

On Sun, Apr 30, 2023 at 10:08 PM Hanumath Maduri
<ha...@gmail.com> wrote:
>
> Dear Calcite Devs,
>
> I would like to start prototyping the approach for the optimization
> mentioned in (CALCITE-5631
> <https://issues.apache.org/jira/browse/CALCITE-5631>). So I just want to
> follow up on the approach mentioned in the above document.
> Please let me know if this approach seems reasonable to all of you.
>
> @Stamatis and @Julian,  Please let me know if you got a chance to take a
> look at the document and are in-line with the approach.
> Any suggestions/clarifications on this approach would be very helpful.
>
> Thanks,
>
>
> On Mon, Apr 24, 2023 at 7:25 PM Hanumath Maduri <ha...@gmail.com>
> wrote:
>
> >
> > Thanks Julian and Stamatis for sharing your thoughts on this optimization.
> > I have jotted down a general approach which could optimize the common
> > subexpression with join elimination use cases in the following document.
> > Please go through this
> > <https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing>
> > document and let me know what you think of this approach.
> >
> > In gist the approach mentioned in this document is
> > 1. Finding the common subexpressions
> > 2. Using the subexpression information to eliminate the joins when
> > possible.
> >
> > I did think about implementing this optimization in the decorrelator as
> > well, but this approach was not explored further because it could miss
> > finding non correlated common sub expressions.
> >
> > Let me know if you have any questions or suggestions on this approach.
> >
> > Thanks,
> > Hanu
> >
> >
> >
> >
> > On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <jh...@gmail.com>
> > wrote:
> >
> >> Thank you, Stamatis. This is helpful. I have linked to this email thread
> >> in CALCITE-5631.
> >>
> >> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <za...@gmail.com>
> >> wrote:
> >> >
> >> > Few quick thoughts about this.
> >> >
> >> > For to the problem of query minimization/redundant joins the simpler
> >> > scenarios that I can think of are the following:
> >> >
> >> > # Scenario A
> >> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name;
> >> >
> >> > If you know the name column is UNIQUE then you can drop the join on e2.
> >> >
> >> > # Scenario B
> >> > select e.name from emp e inner join dept d on e.dept = d.id;
> >> >
> >> > If you know that e.dept and d.id is a foreign key relationship then
> >> > you can drop the join on dept.
> >> >
> >> > There are probably other cases to define and handle but we should move
> >> > incrementally.
> >> >
> >> > As Julian pointed out, the issue logged  in CALITE-5631 could also be
> >> > addressed by employing common table expression related optimizations.
> >> > CTE optimizations and query minimization are both interesting and
> >> > powerful techniques to reduce the cost of the query (whatever that is
> >> > speed, money, resources, etc).
> >> >
> >> > I would suggest focusing on query minimization first since it is
> >> > pretty well defined and we could come up with solutions much faster
> >> > than CTEs. CTEs usually come up with decisions about materializing or
> >> > not the common expressions which are closer to lower level ("physical
> >> > plan") optimizations.
> >> >
> >> > Most minimization techniques focus on select project join (SPJ)
> >> > queries so I guess we would have to do some preprocessing to bring the
> >> > plan in this format (only Scan, Project, Filter, and Join operators)
> >> > before applying the rule. It would be a separate planning phase
> >> > combining a bunch of existing rules followed by some new which is
> >> > inline with what Julian was saying about bottom-up unification.
> >> >
> >> > The new rule could be something similar to LoptOptimizeJoinRule that
> >> > operates on a MultiJoin. I haven't checked if the MultiJoin operator
> >> > is sufficient to express an SPJ query but I think the general idea of
> >> > grouping joins together seems to be a promising direction for writing
> >> > new rules.
> >> >
> >> > Best,
> >> > Stamatis
> >> >
> >> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <jh...@apache.org> wrote:
> >> >>
> >> >> Ian Bertolacci recently logged
> >> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert
> >> >>
> >> >>  select
> >> >>     (select numarrayagg(C5633_203) from T893 where C5633_586 =
> >> T895.id),
> >> >>     (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id)
> >> >>  from T895
> >> >>
> >> >> into
> >> >>
> >> >>  select agg.agg1,
> >> >>      agg.agg2
> >> >>  from T895
> >> >>  left join (
> >> >>    select C5633_586,
> >> >>        numarrayagg(C5633_203) as agg1,
> >> >>        numarrayagg(C5633_170) as agg2
> >> >>    from T893
> >> >>    where C5633_586 is not null
> >> >>    group by C5633_586) as agg
> >> >>  on agg.C5633_586 = T895.id
> >> >>
> >> >> This seems to me an interesting and important problem. But it's also a
> >> >> hard problem, and it's not clear to me which approach is the best.
> >> >> Does anyone have any ideas for how to approach it?
> >> >>
> >> >> Also, we could use more example queries that illustrate the general
> >> >> pattern.  (Preferably in terms of simple databases such as EMP and
> >> >> DEPT.)
> >> >>
> >> >> In Calcite rewrite rules (RelRule) are usually the preferred approach.
> >> >> Because the common relational expressions scans can be an arbitrary
> >> >> distance apart in the RelNode tree, RelRule doesn't seem suitable.
> >> >>
> >> >> There seem to be some similarities to algorithms to use materialized
> >> >> views, which use bottom-up unification.
> >> >>
> >> >> Ian's original query actually has correlated scalar sub-queries rather
> >> >> than explicit joins. Would it be better to target common sub-queries
> >> >> rather than joins?
> >> >>
> >> >> Lastly, there are similarities with the WinMagic algorithm, which
> >> >> converts correlated sub-queries into window aggregates. Is that a
> >> >> useful direction? (My implementation of measures in CALCITE-4496
> >> >> naturally creates correlated scalar sub-queries that can be inlined in
> >> >> the enclosing query if simple, or converted to window aggregates if
> >> >> more complex.)
> >> >>
> >> >> Julian
> >>
> >>

Re: Rewrite rule to convert self-joins into scans

Posted by Hanumath Maduri <ha...@gmail.com>.

Hello Stamatis,

Thank you for the comments and also providing the relevant
information/links. The page and code which you have shared is very helpful.

I have updated the document with more information in the new subheadings *CTE
Table Representation* and* Detailed explanation* of how the query goes
through these phases.
I think I wasn't clear earlier in the document to represent CTEs. All along
I was under the assumption to use Spool Node (which is provided by
TableSpool operator in calcite). I mentioned this explicitly now in the
document.

In regards to the RelToSql conversion with TableSpool Node, I agree with
you to use WITH construct during the conversion. However, it is upto the
engine to decide if it uses the WITH construct to merge them during the
execution.

As regards to where these optimization phases can be triggered, I think the
new phases(mentioned in the document) are more beneficial to be triggered
at the end of the logical optimizations and before physical node
conversion. Please let me know if you have any suggestions on this.

Please help me understand if you are expecting more details other than
mentioned in the document (I think we can get over a call to discuss
further). I can also start working on the pseudo code so as to provide more
details.

@Zoltán Haindrich Thanks for reviewing the document. I have replied to the
questions posted in the document, please take a look when you get a chance.

Thanks,
Hanu

On Mon, May 1, 2023 at 2:25 PM Stamatis Zampetakis <za...@gmail.com>
wrote:

> Hi Hanumath,
>
> Thanks for putting this doc together; the examples are helpful and I
> find the approach interesting.
>
> CTEs and redundant joins have similarities but I see them as different
> problems. CTEs are patterns that appear multiple times in a query but
> are not necessarily redundant.
>
> Apart from identifying CTEs I am actually pretty interested in how to
> represent them. It may be helpful to add some concrete examples/plans
> covering the representation part in the document you shared.
>
> A few weeks ago, I created a doc [1] with some preliminary ideas about
> CTEs in the context of Hive covering briefly:
> * representation
> * identification
> * logical -> physical for Hive
>
> It would be great if we can converge in a representation that is
> useful for multiple projects and built on top of that. The
> identification part is somewhat pluggable and it could be different
> depending on the use-case.
>
> Best,
> Stamatis
>
> [1]
> https://github.com/zabetak/docs/blob/main/2023-03-09-Hive-cbo-cte-optimization/index.md
>
> On Sun, Apr 30, 2023 at 10:08 PM Hanumath Maduri
> <ha...@gmail.com> wrote:
> >
> > Dear Calcite Devs,
> >
> > I would like to start prototyping the approach for the optimization
> > mentioned in (CALCITE-5631
> > <https://issues.apache.org/jira/browse/CALCITE-5631>). So I just want to
> > follow up on the approach mentioned in the above document.
> > Please let me know if this approach seems reasonable to all of you.
> >
> > @Stamatis and @Julian,  Please let me know if you got a chance to take a
> > look at the document and are in-line with the approach.
> > Any suggestions/clarifications on this approach would be very helpful.
> >
> > Thanks,
> >
> >
> > On Mon, Apr 24, 2023 at 7:25 PM Hanumath Maduri <
> hanumathmaduri@gmail.com>
> > wrote:
> >
> > >
> > > Thanks Julian and Stamatis for sharing your thoughts on this
> optimization.
> > > I have jotted down a general approach which could optimize the common
> > > subexpression with join elimination use cases in the following
> document.
> > > Please go through this
> > > <
> https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing
> >
> > > document and let me know what you think of this approach.
> > >
> > > In gist the approach mentioned in this document is
> > > 1. Finding the common subexpressions
> > > 2. Using the subexpression information to eliminate the joins when
> > > possible.
> > >
> > > I did think about implementing this optimization in the decorrelator as
> > > well, but this approach was not explored further because it could miss
> > > finding non correlated common sub expressions.
> > >
> > > Let me know if you have any questions or suggestions on this approach.
> > >
> > > Thanks,
> > > Hanu
> > >
> > >
> > >
> > >
> > > On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <jh...@gmail.com>
> > > wrote:
> > >
> > >> Thank you, Stamatis. This is helpful. I have linked to this email
> thread
> > >> in CALCITE-5631.
> > >>
> > >> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <zabetak@gmail.com
> >
> > >> wrote:
> > >> >
> > >> > Few quick thoughts about this.
> > >> >
> > >> > For to the problem of query minimization/redundant joins the simpler
> > >> > scenarios that I can think of are the following:
> > >> >
> > >> > # Scenario A
> > >> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name;
> > >> >
> > >> > If you know the name column is UNIQUE then you can drop the join on
> e2.
> > >> >
> > >> > # Scenario B
> > >> > select e.name from emp e inner join dept d on e.dept = d.id;
> > >> >
> > >> > If you know that e.dept and d.id is a foreign key relationship then
> > >> > you can drop the join on dept.
> > >> >
> > >> > There are probably other cases to define and handle but we should
> move
> > >> > incrementally.
> > >> >
> > >> > As Julian pointed out, the issue logged  in CALITE-5631 could also
> be
> > >> > addressed by employing common table expression related
> optimizations.
> > >> > CTE optimizations and query minimization are both interesting and
> > >> > powerful techniques to reduce the cost of the query (whatever that
> is
> > >> > speed, money, resources, etc).
> > >> >
> > >> > I would suggest focusing on query minimization first since it is
> > >> > pretty well defined and we could come up with solutions much faster
> > >> > than CTEs. CTEs usually come up with decisions about materializing
> or
> > >> > not the common expressions which are closer to lower level
> ("physical
> > >> > plan") optimizations.
> > >> >
> > >> > Most minimization techniques focus on select project join (SPJ)
> > >> > queries so I guess we would have to do some preprocessing to bring
> the
> > >> > plan in this format (only Scan, Project, Filter, and Join operators)
> > >> > before applying the rule. It would be a separate planning phase
> > >> > combining a bunch of existing rules followed by some new which is
> > >> > inline with what Julian was saying about bottom-up unification.
> > >> >
> > >> > The new rule could be something similar to LoptOptimizeJoinRule that
> > >> > operates on a MultiJoin. I haven't checked if the MultiJoin operator
> > >> > is sufficient to express an SPJ query but I think the general idea
> of
> > >> > grouping joins together seems to be a promising direction for
> writing
> > >> > new rules.
> > >> >
> > >> > Best,
> > >> > Stamatis
> > >> >
> > >> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <jh...@apache.org>
> wrote:
> > >> >>
> > >> >> Ian Bertolacci recently logged
> > >> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert
> > >> >>
> > >> >>  select
> > >> >>     (select numarrayagg(C5633_203) from T893 where C5633_586 =
> > >> T895.id),
> > >> >>     (select numarrayagg(C5633_170) from T893 where C5633_586 =
> T895.id)
> > >> >>  from T895
> > >> >>
> > >> >> into
> > >> >>
> > >> >>  select agg.agg1,
> > >> >>      agg.agg2
> > >> >>  from T895
> > >> >>  left join (
> > >> >>    select C5633_586,
> > >> >>        numarrayagg(C5633_203) as agg1,
> > >> >>        numarrayagg(C5633_170) as agg2
> > >> >>    from T893
> > >> >>    where C5633_586 is not null
> > >> >>    group by C5633_586) as agg
> > >> >>  on agg.C5633_586 = T895.id
> > >> >>
> > >> >> This seems to me an interesting and important problem. But it's
> also a
> > >> >> hard problem, and it's not clear to me which approach is the best.
> > >> >> Does anyone have any ideas for how to approach it?
> > >> >>
> > >> >> Also, we could use more example queries that illustrate the general
> > >> >> pattern.  (Preferably in terms of simple databases such as EMP and
> > >> >> DEPT.)
> > >> >>
> > >> >> In Calcite rewrite rules (RelRule) are usually the preferred
> approach.
> > >> >> Because the common relational expressions scans can be an arbitrary
> > >> >> distance apart in the RelNode tree, RelRule doesn't seem suitable.
> > >> >>
> > >> >> There seem to be some similarities to algorithms to use
> materialized
> > >> >> views, which use bottom-up unification.
> > >> >>
> > >> >> Ian's original query actually has correlated scalar sub-queries
> rather
> > >> >> than explicit joins. Would it be better to target common
> sub-queries
> > >> >> rather than joins?
> > >> >>
> > >> >> Lastly, there are similarities with the WinMagic algorithm, which
> > >> >> converts correlated sub-queries into window aggregates. Is that a
> > >> >> useful direction? (My implementation of measures in CALCITE-4496
> > >> >> naturally creates correlated scalar sub-queries that can be
> inlined in
> > >> >> the enclosing query if simple, or converted to window aggregates if
> > >> >> more complex.)
> > >> >>
> > >> >> Julian
> > >>
> > >>
>