You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Priyendra Deshwal <pr...@gmail.com> on 2021/03/01 05:44:35 UTC

Nested loop joins

Hello friends,

I am playing around with TPC-DS schema and playing with the following
simplified variant of query41.

select  distinct(i_product_name)
 from item i1
 where i_manufact_id between 738 and 738+40
   and (select count(*) as item_cnt
        from item
        where (i_manufact = i1.i_manufact and i_category = 'Women') or
                   (i_manufact = i1.i_manufact and i_category = 'Men')) > 0
 order by i_product_name
 limit 100

This results in the following optimized plan. Note that the join condition
(i_manufact = i1.i_manufact) is not clearly expressed in this query since
it is repeated in both OR clauses of the inner query. This results in a
nested loop join and even the filter on i_category is not pushed all the
way down to the query.

EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
2.929266516286664E9, id = 408
  EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
cost = 2.929266416286664E9, id = 406
    EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
= 2.874591416286664E9, id = 404
      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
2.819916416286664E9, id = 410
        EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
          EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 386
            BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
          EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
810000.0, cumulative cost = 8.193105E7, id = 398
            EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
8100000.0, cumulative cost = 8.10198E7, id = 396
              EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 389
                BindableTableScan(table=[[default, ITEM]], projects=[[12,
14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
              EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
cumulative cost = 10800.0, id = 394
                EnumerableInterpreter: rowcount = 18000.0, cumulative cost
= 9000.0, id = 392
                  BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
rowcount = 18000.0, cumulative cost = 11.25, id = 364

A simple rewrite of the query as follows where we "factor" out the join
condition to the top level AND does make the plan significantly better
(hash join with complete filter push down to the source).

select  distinct(i_product_name)
 from item i1
 where i_manufact_id between 738 and 738+40
   and (select count(*) as item_cnt
        from item
        where i_manufact = i1.i_manufact and
                   (i_category = 'Women' or i_category = 'Men')) > 0
 order by i_product_name
 limit 100

EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
6523491.28666381, id = 336
  EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
cost = 6523391.28666381, id = 334
    EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
= 6401891.28666381, id = 332
      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 1215000.0, cumulative cost =
6280391.28666381, id = 338
        EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
          EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 321
            BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
          EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
1800.0, cumulative cost = 11025.0, id = 326
            EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 324
              BindableTableScan(table=[[default, ITEM]],
filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
= 18000.0, cumulative cost = 11.25, id = 302

Is there any built-in Calcite rule that I can invoke which can do this type
of factorization of common where sub-conditions?

Thanks!

Re: Nested loop joins

Posted by JiaTao Tao <ta...@gmail.com>.

The JIRA is: https://issues.apache.org/jira/browse/CALCITE-4375


Regards!

Aron Tao


JiaTao Tao <ta...@gmail.com> 于2021年3月2日周二 上午11:39写道：

> Hi
>
> I met the same problem and you can take a look at this JIRA.
> I also pull a request, but it does not get merged:
> https://github.com/apache/calcite/pull/2253
> This optimization I think is very common and useful, I'll continue this
> work soon.
>
> cc Julian
>
> Regards!
>
> Aron Tao
>
>
> Priyendra Deshwal <pr...@gmail.com> 于2021年3月1日周一 下午4:52写道：
>
>> That rule was already enabled for me. I also tried enabling a collection
>> of
>> other *_REDUCE_EXPRESSION rules but it did not help. Reading the
>> underlying
>> code, it suggests that these rules help with constant elimination but
>> perhaps do not do the kind of common expression factoring that may be
>> needed for this particular case.
>>
>>
>> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
>>
>> > Hi Priyendra,
>> >
>> > We have FilterReduceExpressionsRule which reduces filter conditions.
>> > It performs the simplification based on
>> org.apache.calcite.rex.RexSimplify.
>> >
>> > Best,
>> > Liya Fan
>> >
>> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
>> > wrote:
>> >
>> > > Hello friends,
>> > >
>> > > I am playing around with TPC-DS schema and playing with the following
>> > > simplified variant of query41.
>> > >
>> > > select  distinct(i_product_name)
>> > >  from item i1
>> > >  where i_manufact_id between 738 and 738+40
>> > >    and (select count(*) as item_cnt
>> > >         from item
>> > >         where (i_manufact = i1.i_manufact and i_category = 'Women') or
>> > >                    (i_manufact = i1.i_manufact and i_category =
>> 'Men'))
>> > > 0
>> > >  order by i_product_name
>> > >  limit 100
>> > >
>> > > This results in the following optimized plan. Note that the join
>> > condition
>> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
>> since
>> > > it is repeated in both OR clauses of the inner query. This results in
>> a
>> > > nested loop join and even the filter on i_category is not pushed all
>> the
>> > > way down to the query.
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 2.929266516286664E9, id = 408
>> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
>> cumulative
>> > > cost = 2.929266416286664E9, id = 406
>> > >     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
>> > cost
>> > > = 2.874591416286664E9, id = 404
>> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
>> > > 2.819916416286664E9, id = 410
>> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
>> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 386
>> > >             BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
>> > >           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 810000.0, cumulative cost = 8.193105E7, id = 398
>> > >             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2),
>> =($0,
>> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]):
>> rowcount =
>> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
>> > >               EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost
>> > =
>> > > 9000.0, id = 389
>> > >                 BindableTableScan(table=[[default, ITEM]],
>> projects=[[12,
>> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
>> 231
>> > >               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
>> > > cumulative cost = 10800.0, id = 394
>> > >                 EnumerableInterpreter: rowcount = 18000.0, cumulative
>> > cost
>> > > = 9000.0, id = 392
>> > >                   BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
>> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
>> > >
>> > > A simple rewrite of the query as follows where we "factor" out the
>> join
>> > > condition to the top level AND does make the plan significantly better
>> > > (hash join with complete filter push down to the source).
>> > >
>> > > select  distinct(i_product_name)
>> > >  from item i1
>> > >  where i_manufact_id between 738 and 738+40
>> > >    and (select count(*) as item_cnt
>> > >         from item
>> > >         where i_manufact = i1.i_manufact and
>> > >                    (i_category = 'Women' or i_category = 'Men')) > 0
>> > >  order by i_product_name
>> > >  limit 100
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 6523491.28666381, id = 336
>> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
>> cumulative
>> > > cost = 6523391.28666381, id = 334
>> > >     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
>> > cost
>> > > = 6401891.28666381, id = 332
>> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
>> > > 6280391.28666381, id = 338
>> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
>> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 321
>> > >             BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
>> > >           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 1800.0, cumulative cost = 11025.0, id = 326
>> > >             EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost =
>> > > 9000.0, id = 324
>> > >               BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
>> > rowcount
>> > > = 18000.0, cumulative cost = 11.25, id = 302
>> > >
>> > > Is there any built-in Calcite rule that I can invoke which can do this
>> > type
>> > > of factorization of common where sub-conditions?
>> > >
>> > > Thanks!
>> > >
>> >
>>
>

Re: Nested loop joins

Posted by JiaTao Tao <ta...@gmail.com>.

In fact, it's a very common optimization that already implemented by other
engines like Spark.


Regards!

Aron Tao


JiaTao Tao <ta...@gmail.com> 于2021年3月2日周二 上午11:39写道：

> Hi
>
> I met the same problem and you can take a look at this JIRA.
> I also pull a request, but it does not get merged:
> https://github.com/apache/calcite/pull/2253
> This optimization I think is very common and useful, I'll continue this
> work soon.
>
> cc Julian
>
> Regards!
>
> Aron Tao
>
>
> Priyendra Deshwal <pr...@gmail.com> 于2021年3月1日周一 下午4:52写道：
>
>> That rule was already enabled for me. I also tried enabling a collection
>> of
>> other *_REDUCE_EXPRESSION rules but it did not help. Reading the
>> underlying
>> code, it suggests that these rules help with constant elimination but
>> perhaps do not do the kind of common expression factoring that may be
>> needed for this particular case.
>>
>>
>> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
>>
>> > Hi Priyendra,
>> >
>> > We have FilterReduceExpressionsRule which reduces filter conditions.
>> > It performs the simplification based on
>> org.apache.calcite.rex.RexSimplify.
>> >
>> > Best,
>> > Liya Fan
>> >
>> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
>> > wrote:
>> >
>> > > Hello friends,
>> > >
>> > > I am playing around with TPC-DS schema and playing with the following
>> > > simplified variant of query41.
>> > >
>> > > select  distinct(i_product_name)
>> > >  from item i1
>> > >  where i_manufact_id between 738 and 738+40
>> > >    and (select count(*) as item_cnt
>> > >         from item
>> > >         where (i_manufact = i1.i_manufact and i_category = 'Women') or
>> > >                    (i_manufact = i1.i_manufact and i_category =
>> 'Men'))
>> > > 0
>> > >  order by i_product_name
>> > >  limit 100
>> > >
>> > > This results in the following optimized plan. Note that the join
>> > condition
>> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
>> since
>> > > it is repeated in both OR clauses of the inner query. This results in
>> a
>> > > nested loop join and even the filter on i_category is not pushed all
>> the
>> > > way down to the query.
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 2.929266516286664E9, id = 408
>> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
>> cumulative
>> > > cost = 2.929266416286664E9, id = 406
>> > >     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
>> > cost
>> > > = 2.874591416286664E9, id = 404
>> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
>> > > 2.819916416286664E9, id = 410
>> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
>> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 386
>> > >             BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
>> > >           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 810000.0, cumulative cost = 8.193105E7, id = 398
>> > >             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2),
>> =($0,
>> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]):
>> rowcount =
>> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
>> > >               EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost
>> > =
>> > > 9000.0, id = 389
>> > >                 BindableTableScan(table=[[default, ITEM]],
>> projects=[[12,
>> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
>> 231
>> > >               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
>> > > cumulative cost = 10800.0, id = 394
>> > >                 EnumerableInterpreter: rowcount = 18000.0, cumulative
>> > cost
>> > > = 9000.0, id = 392
>> > >                   BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
>> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
>> > >
>> > > A simple rewrite of the query as follows where we "factor" out the
>> join
>> > > condition to the top level AND does make the plan significantly better
>> > > (hash join with complete filter push down to the source).
>> > >
>> > > select  distinct(i_product_name)
>> > >  from item i1
>> > >  where i_manufact_id between 738 and 738+40
>> > >    and (select count(*) as item_cnt
>> > >         from item
>> > >         where i_manufact = i1.i_manufact and
>> > >                    (i_category = 'Women' or i_category = 'Men')) > 0
>> > >  order by i_product_name
>> > >  limit 100
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 6523491.28666381, id = 336
>> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
>> cumulative
>> > > cost = 6523391.28666381, id = 334
>> > >     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
>> > cost
>> > > = 6401891.28666381, id = 332
>> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
>> > > 6280391.28666381, id = 338
>> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
>> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 321
>> > >             BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
>> > >           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 1800.0, cumulative cost = 11025.0, id = 326
>> > >             EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost =
>> > > 9000.0, id = 324
>> > >               BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
>> > rowcount
>> > > = 18000.0, cumulative cost = 11.25, id = 302
>> > >
>> > > Is there any built-in Calcite rule that I can invoke which can do this
>> > type
>> > > of factorization of common where sub-conditions?
>> > >
>> > > Thanks!
>> > >
>> >
>>
>

Re: Nested loop joins

Posted by JiaTao Tao <ta...@gmail.com>.

Hi

I met the same problem and you can take a look at this JIRA.
I also pull a request, but it does not get merged:
https://github.com/apache/calcite/pull/2253
This optimization I think is very common and useful, I'll continue this
work soon.

cc Julian

Regards!

Aron Tao


Priyendra Deshwal <pr...@gmail.com> 于2021年3月1日周一 下午4:52写道：

> That rule was already enabled for me. I also tried enabling a collection of
> other *_REDUCE_EXPRESSION rules but it did not help. Reading the underlying
> code, it suggests that these rules help with constant elimination but
> perhaps do not do the kind of common expression factoring that may be
> needed for this particular case.
>
>
> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Priyendra,
> >
> > We have FilterReduceExpressionsRule which reduces filter conditions.
> > It performs the simplification based on
> org.apache.calcite.rex.RexSimplify.
> >
> > Best,
> > Liya Fan
> >
> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
> > wrote:
> >
> > > Hello friends,
> > >
> > > I am playing around with TPC-DS schema and playing with the following
> > > simplified variant of query41.
> > >
> > > select  distinct(i_product_name)
> > >  from item i1
> > >  where i_manufact_id between 738 and 738+40
> > >    and (select count(*) as item_cnt
> > >         from item
> > >         where (i_manufact = i1.i_manufact and i_category = 'Women') or
> > >                    (i_manufact = i1.i_manufact and i_category = 'Men'))
> > > 0
> > >  order by i_product_name
> > >  limit 100
> > >
> > > This results in the following optimized plan. Note that the join
> > condition
> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
> since
> > > it is repeated in both OR clauses of the inner query. This results in a
> > > nested loop join and even the filter on i_category is not pushed all
> the
> > > way down to the query.
> > >
> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > > 2.929266516286664E9, id = 408
> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
> cumulative
> > > cost = 2.929266416286664E9, id = 406
> > >     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
> > cost
> > > = 2.874591416286664E9, id = 404
> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> > proj#0..4=[{exprs}],
> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> > > 2.819916416286664E9, id = 410
> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > > 9000.0, id = 386
> > >             BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
> > >           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
> rowcount
> > =
> > > 810000.0, cumulative cost = 8.193105E7, id = 398
> > >             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount
> =
> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
> > >               EnumerableInterpreter: rowcount = 18000.0, cumulative
> cost
> > =
> > > 9000.0, id = 389
> > >                 BindableTableScan(table=[[default, ITEM]],
> projects=[[12,
> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
> 231
> > >               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> > > cumulative cost = 10800.0, id = 394
> > >                 EnumerableInterpreter: rowcount = 18000.0, cumulative
> > cost
> > > = 9000.0, id = 392
> > >                   BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
> > >
> > > A simple rewrite of the query as follows where we "factor" out the join
> > > condition to the top level AND does make the plan significantly better
> > > (hash join with complete filter push down to the source).
> > >
> > > select  distinct(i_product_name)
> > >  from item i1
> > >  where i_manufact_id between 738 and 738+40
> > >    and (select count(*) as item_cnt
> > >         from item
> > >         where i_manufact = i1.i_manufact and
> > >                    (i_category = 'Women' or i_category = 'Men')) > 0
> > >  order by i_product_name
> > >  limit 100
> > >
> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > > 6523491.28666381, id = 336
> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
> cumulative
> > > cost = 6523391.28666381, id = 334
> > >     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
> > cost
> > > = 6401891.28666381, id = 332
> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> > proj#0..4=[{exprs}],
> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> > > 6280391.28666381, id = 338
> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > > 9000.0, id = 321
> > >             BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
> > >           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
> rowcount
> > =
> > > 1800.0, cumulative cost = 11025.0, id = 326
> > >             EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> =
> > > 9000.0, id = 324
> > >               BindableTableScan(table=[[default, ITEM]],
> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
> > rowcount
> > > = 18000.0, cumulative cost = 11.25, id = 302
> > >
> > > Is there any built-in Calcite rule that I can invoke which can do this
> > type
> > > of factorization of common where sub-conditions?
> > >
> > > Thanks!
> > >
> >
>

Re: Nested loop joins

Posted by Priyendra Deshwal <pr...@gmail.com>.

That rule was already enabled for me. I also tried enabling a collection of
other *_REDUCE_EXPRESSION rules but it did not help. Reading the underlying
code, it suggests that these rules help with constant elimination but
perhaps do not do the kind of common expression factoring that may be
needed for this particular case.


On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:

> Hi Priyendra,
>
> We have FilterReduceExpressionsRule which reduces filter conditions.
> It performs the simplification based on org.apache.calcite.rex.RexSimplify.
>
> Best,
> Liya Fan
>
> On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
> wrote:
>
> > Hello friends,
> >
> > I am playing around with TPC-DS schema and playing with the following
> > simplified variant of query41.
> >
> > select  distinct(i_product_name)
> >  from item i1
> >  where i_manufact_id between 738 and 738+40
> >    and (select count(*) as item_cnt
> >         from item
> >         where (i_manufact = i1.i_manufact and i_category = 'Women') or
> >                    (i_manufact = i1.i_manufact and i_category = 'Men'))
> > 0
> >  order by i_product_name
> >  limit 100
> >
> > This results in the following optimized plan. Note that the join
> condition
> > (i_manufact = i1.i_manufact) is not clearly expressed in this query since
> > it is repeated in both OR clauses of the inner query. This results in a
> > nested loop join and even the filter on i_category is not pushed all the
> > way down to the query.
> >
> > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > 2.929266516286664E9, id = 408
> >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
> > cost = 2.929266416286664E9, id = 406
> >     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
> cost
> > = 2.874591416286664E9, id = 404
> >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> proj#0..4=[{exprs}],
> > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> > 2.819916416286664E9, id = 410
> >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
> >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > 9000.0, id = 386
> >             BindableTableScan(table=[[default, ITEM]],
> > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
> >           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount
> =
> > 810000.0, cumulative cost = 8.193105E7, id = 398
> >             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
> > 8100000.0, cumulative cost = 8.10198E7, id = 396
> >               EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> =
> > 9000.0, id = 389
> >                 BindableTableScan(table=[[default, ITEM]], projects=[[12,
> > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
> >               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> > cumulative cost = 10800.0, id = 394
> >                 EnumerableInterpreter: rowcount = 18000.0, cumulative
> cost
> > = 9000.0, id = 392
> >                   BindableTableScan(table=[[default, ITEM]],
> > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> > rowcount = 18000.0, cumulative cost = 11.25, id = 364
> >
> > A simple rewrite of the query as follows where we "factor" out the join
> > condition to the top level AND does make the plan significantly better
> > (hash join with complete filter push down to the source).
> >
> > select  distinct(i_product_name)
> >  from item i1
> >  where i_manufact_id between 738 and 738+40
> >    and (select count(*) as item_cnt
> >         from item
> >         where i_manufact = i1.i_manufact and
> >                    (i_category = 'Women' or i_category = 'Men')) > 0
> >  order by i_product_name
> >  limit 100
> >
> > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > 6523491.28666381, id = 336
> >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
> > cost = 6523391.28666381, id = 334
> >     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
> cost
> > = 6401891.28666381, id = 332
> >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> proj#0..4=[{exprs}],
> > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> > 6280391.28666381, id = 338
> >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
> >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > 9000.0, id = 321
> >             BindableTableScan(table=[[default, ITEM]],
> > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
> >           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount
> =
> > 1800.0, cumulative cost = 11025.0, id = 326
> >             EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > 9000.0, id = 324
> >               BindableTableScan(table=[[default, ITEM]],
> > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
> rowcount
> > = 18000.0, cumulative cost = 11.25, id = 302
> >
> > Is there any built-in Calcite rule that I can invoke which can do this
> type
> > of factorization of common where sub-conditions?
> >
> > Thanks!
> >
>

Re: Nested loop joins

Posted by Fan Liya <li...@gmail.com>.

Hi Priyendra,

We have FilterReduceExpressionsRule which reduces filter conditions.
It performs the simplification based on org.apache.calcite.rex.RexSimplify.

Best,
Liya Fan

On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
wrote:

> Hello friends,
>
> I am playing around with TPC-DS schema and playing with the following
> simplified variant of query41.
>
> select  distinct(i_product_name)
>  from item i1
>  where i_manufact_id between 738 and 738+40
>    and (select count(*) as item_cnt
>         from item
>         where (i_manufact = i1.i_manufact and i_category = 'Women') or
>                    (i_manufact = i1.i_manufact and i_category = 'Men')) > 0
>  order by i_product_name
>  limit 100
>
> This results in the following optimized plan. Note that the join condition
> (i_manufact = i1.i_manufact) is not clearly expressed in this query since
> it is repeated in both OR clauses of the inner query. This results in a
> nested loop join and even the filter on i_category is not pushed all the
> way down to the query.
>
> EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> 2.929266516286664E9, id = 408
>   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
> cost = 2.929266416286664E9, id = 406
>     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
> = 2.874591416286664E9, id = 404
>       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
> $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> 2.819916416286664E9, id = 410
>         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
>           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 386
>             BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
>           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
> 810000.0, cumulative cost = 8.193105E7, id = 398
>             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
> 8100000.0, cumulative cost = 8.10198E7, id = 396
>               EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 389
>                 BindableTableScan(table=[[default, ITEM]], projects=[[12,
> 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
>               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> cumulative cost = 10800.0, id = 394
>                 EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> = 9000.0, id = 392
>                   BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> rowcount = 18000.0, cumulative cost = 11.25, id = 364
>
> A simple rewrite of the query as follows where we "factor" out the join
> condition to the top level AND does make the plan significantly better
> (hash join with complete filter push down to the source).
>
> select  distinct(i_product_name)
>  from item i1
>  where i_manufact_id between 738 and 738+40
>    and (select count(*) as item_cnt
>         from item
>         where i_manufact = i1.i_manufact and
>                    (i_category = 'Women' or i_category = 'Men')) > 0
>  order by i_product_name
>  limit 100
>
> EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> 6523491.28666381, id = 336
>   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
> cost = 6523391.28666381, id = 334
>     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
> = 6401891.28666381, id = 332
>       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
> $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> 6280391.28666381, id = 338
>         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
>           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 321
>             BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
>           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
> 1800.0, cumulative cost = 11025.0, id = 326
>             EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 324
>               BindableTableScan(table=[[default, ITEM]],
> filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
> = 18000.0, cumulative cost = 11.25, id = 302
>
> Is there any built-in Calcite rule that I can invoke which can do this type
> of factorization of common where sub-conditions?
>
> Thanks!
>