You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Haisheng Yuan <h....@alibaba-inc.com> on 2019/06/04 00:26:57 UTC

Correlate Join SemiJoin transformation

Hi,

Since we have commited CALCITE-2969, and in the next release of 1.21, we will deprecate EquiJoin and make enumerable hash join support non-equi join condition, the optimization door is open to us.

Currently Calcite often generates complicated and inefficient plan for boolean context correlated subquery (putting value context subquery aside, which is more complex during unnesting). e.g.
https://issues.apache.org/jira/browse/CALCITE-2857
https://issues.apache.org/jira/browse/CALCITE-2874

There are many more cases if you want to try and find out. Some underlying reasons are:
1. SemJoin doesn't support non-equi condition
2. Correlate comes with Aggregate in many cases

Sometimes, SubqueryRemoveRule generates Correlate with Aggregate, sometimes generates Join with Aggregate. Then we use SemiJoinRule to match Join(Rel, Agg) pattern to see whether we can transform it to semijoin or not. This is counter intuitive, because we already know it will be a SemiJoin or AntiSemi when we generate Correlate for the subquery. In current way, we may miss the chance to get back SemiJoin for it, and we almost lose the opportunity to do IndexScan if there is index available.

In addition, I see 2 rules JoinAddRedundantSemiJoinRule and SemiJoinRemoveRule seem to provide the chance to do index scan. But through the test cases related with the 2 rules, I don't see any chance for a better plan with them and they don't seem to be able to generate plan with index scan. I doubt whether they are used at all. Can someone shed some light on this and give us some examples, if I get it wrong?

The transformation flow for boolean context sub-query in my head is:
sub-query -> Correate(Semi/AntiSemi) without Aggregate (SubqueryRemoveRule)
Correlate(Semi/AntiSemi) -> NestedLoopJoin (correlate implementation rule)
Correlate(Semi/AntiSemi) -> IndexedNestedLoopJoin (index apply rule)
Correlate(Semi/AntiSemi) -> SemiJoin/AntiSemiJoin (Semi Correlate to Semi Join rule)
SemiJoin -> InnerJoin(rel, Agg) (SemiJoin to Join rule)
InnerJoin(rel, Agg) ->  InnerJoin(Agg, rel) (Join reorder rule)
SemiJoin/InnerJoin -> HashJoin/MergeJoin/NLJ (Join implementation rule)

This is a long shot, and will involve tons of changes.
Any thoughts?

- Haisheng Yuan

Re: Correlate Join SemiJoin transformation

Posted by Yuzhao Chen <yu...@gmail.com>.
I think the complexity mostly comes from the value generator, which is the engineering foundation of current decorrelation.

Best,
Danny Chan
在 2019年6月5日 +0800 AM10:45,Chunwei Lei <ch...@gmail.com>,写道:
> Thanks for raising this, Haisheng.
>
> Do you mean that we should have better subquery unnesting?
>
>
>
>
> Best,
> Chunwei
>
>
> On Tue, Jun 4, 2019 at 8:27 AM Haisheng Yuan <h....@alibaba-inc.com> wrote:
>
> > Hi,
> >
> > Since we have commited CALCITE-2969, and in the next release of 1.21, we
> > will deprecate EquiJoin and make enumerable hash join support non-equi join
> > condition, the optimization door is open to us.
> >
> > Currently Calcite often generates complicated and inefficient plan for
> > boolean context correlated subquery (putting value context subquery aside,
> > which is more complex during unnesting). e.g.
> > https://issues.apache.org/jira/browse/CALCITE-2857
> > https://issues.apache.org/jira/browse/CALCITE-2874
> >
> > There are many more cases if you want to try and find out. Some underlying
> > reasons are:
> > 1. SemJoin doesn't support non-equi condition
> > 2. Correlate comes with Aggregate in many cases
> >
> > Sometimes, SubqueryRemoveRule generates Correlate with Aggregate,
> > sometimes generates Join with Aggregate. Then we use SemiJoinRule to match
> > Join(Rel, Agg) pattern to see whether we can transform it to semijoin or
> > not. This is counter intuitive, because we already know it will be a
> > SemiJoin or AntiSemi when we generate Correlate for the subquery. In
> > current way, we may miss the chance to get back SemiJoin for it, and we
> > almost lose the opportunity to do IndexScan if there is index available.
> >
> > In addition, I see 2 rules JoinAddRedundantSemiJoinRule and
> > SemiJoinRemoveRule seem to provide the chance to do index scan. But through
> > the test cases related with the 2 rules, I don't see any chance for a
> > better plan with them and they don't seem to be able to generate plan with
> > index scan. I doubt whether they are used at all. Can someone shed some
> > light on this and give us some examples, if I get it wrong?
> >
> > The transformation flow for boolean context sub-query in my head is:
> > sub-query -> Correate(Semi/AntiSemi) without Aggregate (SubqueryRemoveRule)
> > Correlate(Semi/AntiSemi) -> NestedLoopJoin (correlate implementation rule)
> > Correlate(Semi/AntiSemi) -> IndexedNestedLoopJoin (index apply rule)
> > Correlate(Semi/AntiSemi) -> SemiJoin/AntiSemiJoin (Semi Correlate to Semi
> > Join rule)
> > SemiJoin -> InnerJoin(rel, Agg) (SemiJoin to Join rule)
> > InnerJoin(rel, Agg) -> InnerJoin(Agg, rel) (Join reorder rule)
> > SemiJoin/InnerJoin -> HashJoin/MergeJoin/NLJ (Join implementation rule)
> >
> > This is a long shot, and will involve tons of changes.
> > Any thoughts?
> >
> > - Haisheng Yuan

Re: Correlate Join SemiJoin transformation

Posted by Chunwei Lei <ch...@gmail.com>.
Thanks for raising this, Haisheng.

Do you mean that we should have better subquery unnesting?




Best,
Chunwei


On Tue, Jun 4, 2019 at 8:27 AM Haisheng Yuan <h....@alibaba-inc.com> wrote:

> Hi,
>
> Since we have commited CALCITE-2969, and in the next release of 1.21, we
> will deprecate EquiJoin and make enumerable hash join support non-equi join
> condition, the optimization door is open to us.
>
> Currently Calcite often generates complicated and inefficient plan for
> boolean context correlated subquery (putting value context subquery aside,
> which is more complex during unnesting). e.g.
> https://issues.apache.org/jira/browse/CALCITE-2857
> https://issues.apache.org/jira/browse/CALCITE-2874
>
> There are many more cases if you want to try and find out. Some underlying
> reasons are:
> 1. SemJoin doesn't support non-equi condition
> 2. Correlate comes with Aggregate in many cases
>
> Sometimes, SubqueryRemoveRule generates Correlate with Aggregate,
> sometimes generates Join with Aggregate. Then we use SemiJoinRule to match
> Join(Rel, Agg) pattern to see whether we can transform it to semijoin or
> not. This is counter intuitive, because we already know it will be a
> SemiJoin or AntiSemi when we generate Correlate for the subquery. In
> current way, we may miss the chance to get back SemiJoin for it, and we
> almost lose the opportunity to do IndexScan if there is index available.
>
> In addition, I see 2 rules JoinAddRedundantSemiJoinRule and
> SemiJoinRemoveRule seem to provide the chance to do index scan. But through
> the test cases related with the 2 rules, I don't see any chance for a
> better plan with them and they don't seem to be able to generate plan with
> index scan. I doubt whether they are used at all. Can someone shed some
> light on this and give us some examples, if I get it wrong?
>
> The transformation flow for boolean context sub-query in my head is:
> sub-query -> Correate(Semi/AntiSemi) without Aggregate (SubqueryRemoveRule)
> Correlate(Semi/AntiSemi) -> NestedLoopJoin (correlate implementation rule)
> Correlate(Semi/AntiSemi) -> IndexedNestedLoopJoin (index apply rule)
> Correlate(Semi/AntiSemi) -> SemiJoin/AntiSemiJoin (Semi Correlate to Semi
> Join rule)
> SemiJoin -> InnerJoin(rel, Agg) (SemiJoin to Join rule)
> InnerJoin(rel, Agg) ->  InnerJoin(Agg, rel) (Join reorder rule)
> SemiJoin/InnerJoin -> HashJoin/MergeJoin/NLJ (Join implementation rule)
>
> This is a long shot, and will involve tons of changes.
> Any thoughts?
>
> - Haisheng Yuan