You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Priyendra Deshwal <pr...@gmail.com> on 2021/03/01 05:44:35 UTC
Nested loop joins
Hello friends,
I am playing around with TPC-DS schema and playing with the following
simplified variant of query41.
select distinct(i_product_name)
from item i1
where i_manufact_id between 738 and 738+40
and (select count(*) as item_cnt
from item
where (i_manufact = i1.i_manufact and i_category = 'Women') or
(i_manufact = i1.i_manufact and i_category = 'Men')) > 0
order by i_product_name
limit 100
This results in the following optimized plan. Note that the join condition
(i_manufact = i1.i_manufact) is not clearly expressed in this query since
it is repeated in both OR clauses of the inner query. This results in a
nested loop join and even the filter on i_category is not pushed all the
way down to the query.
EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
2.929266516286664E9, id = 408
EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
cost = 2.929266416286664E9, id = 406
EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
= 2.874591416286664E9, id = 404
EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
2.819916416286664E9, id = 410
EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 386
BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
810000.0, cumulative cost = 8.193105E7, id = 398
EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
8100000.0, cumulative cost = 8.10198E7, id = 396
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 389
BindableTableScan(table=[[default, ITEM]], projects=[[12,
14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
cumulative cost = 10800.0, id = 394
EnumerableInterpreter: rowcount = 18000.0, cumulative cost
= 9000.0, id = 392
BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
rowcount = 18000.0, cumulative cost = 11.25, id = 364
A simple rewrite of the query as follows where we "factor" out the join
condition to the top level AND does make the plan significantly better
(hash join with complete filter push down to the source).
select distinct(i_product_name)
from item i1
where i_manufact_id between 738 and 738+40
and (select count(*) as item_cnt
from item
where i_manufact = i1.i_manufact and
(i_category = 'Women' or i_category = 'Men')) > 0
order by i_product_name
limit 100
EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
6523491.28666381, id = 336
EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
cost = 6523391.28666381, id = 334
EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
= 6401891.28666381, id = 332
EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 1215000.0, cumulative cost =
6280391.28666381, id = 338
EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 321
BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
1800.0, cumulative cost = 11025.0, id = 326
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 324
BindableTableScan(table=[[default, ITEM]],
filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
= 18000.0, cumulative cost = 11.25, id = 302
Is there any built-in Calcite rule that I can invoke which can do this type
of factorization of common where sub-conditions?
Thanks!
Re: Nested loop joins
Posted by JiaTao Tao <ta...@gmail.com>.
The JIRA is: https://issues.apache.org/jira/browse/CALCITE-4375
Regards!
Aron Tao
JiaTao Tao <ta...@gmail.com> 于2021年3月2日周二 上午11:39写道:
> Hi
>
> I met the same problem and you can take a look at this JIRA.
> I also pull a request, but it does not get merged:
> https://github.com/apache/calcite/pull/2253
> This optimization I think is very common and useful, I'll continue this
> work soon.
>
> cc Julian
>
> Regards!
>
> Aron Tao
>
>
> Priyendra Deshwal <pr...@gmail.com> 于2021年3月1日周一 下午4:52写道:
>
>> That rule was already enabled for me. I also tried enabling a collection
>> of
>> other *_REDUCE_EXPRESSION rules but it did not help. Reading the
>> underlying
>> code, it suggests that these rules help with constant elimination but
>> perhaps do not do the kind of common expression factoring that may be
>> needed for this particular case.
>>
>>
>> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
>>
>> > Hi Priyendra,
>> >
>> > We have FilterReduceExpressionsRule which reduces filter conditions.
>> > It performs the simplification based on
>> org.apache.calcite.rex.RexSimplify.
>> >
>> > Best,
>> > Liya Fan
>> >
>> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
>> > wrote:
>> >
>> > > Hello friends,
>> > >
>> > > I am playing around with TPC-DS schema and playing with the following
>> > > simplified variant of query41.
>> > >
>> > > select distinct(i_product_name)
>> > > from item i1
>> > > where i_manufact_id between 738 and 738+40
>> > > and (select count(*) as item_cnt
>> > > from item
>> > > where (i_manufact = i1.i_manufact and i_category = 'Women') or
>> > > (i_manufact = i1.i_manufact and i_category =
>> 'Men'))
>> > > 0
>> > > order by i_product_name
>> > > limit 100
>> > >
>> > > This results in the following optimized plan. Note that the join
>> > condition
>> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
>> since
>> > > it is repeated in both OR clauses of the inner query. This results in
>> a
>> > > nested loop join and even the filter on i_category is not pushed all
>> the
>> > > way down to the query.
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 2.929266516286664E9, id = 408
>> > > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
>> cumulative
>> > > cost = 2.929266416286664E9, id = 406
>> > > EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
>> > cost
>> > > = 2.874591416286664E9, id = 404
>> > > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
>> > > 2.819916416286664E9, id = 410
>> > > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 386
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
>> > > EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 810000.0, cumulative cost = 8.193105E7, id = 398
>> > > EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2),
>> =($0,
>> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]):
>> rowcount =
>> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost
>> > =
>> > > 9000.0, id = 389
>> > > BindableTableScan(table=[[default, ITEM]],
>> projects=[[12,
>> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
>> 231
>> > > EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
>> > > cumulative cost = 10800.0, id = 394
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
>> > cost
>> > > = 9000.0, id = 392
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
>> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
>> > >
>> > > A simple rewrite of the query as follows where we "factor" out the
>> join
>> > > condition to the top level AND does make the plan significantly better
>> > > (hash join with complete filter push down to the source).
>> > >
>> > > select distinct(i_product_name)
>> > > from item i1
>> > > where i_manufact_id between 738 and 738+40
>> > > and (select count(*) as item_cnt
>> > > from item
>> > > where i_manufact = i1.i_manufact and
>> > > (i_category = 'Women' or i_category = 'Men')) > 0
>> > > order by i_product_name
>> > > limit 100
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 6523491.28666381, id = 336
>> > > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
>> cumulative
>> > > cost = 6523391.28666381, id = 334
>> > > EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
>> > cost
>> > > = 6401891.28666381, id = 332
>> > > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
>> > > 6280391.28666381, id = 338
>> > > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 321
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
>> > > EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 1800.0, cumulative cost = 11025.0, id = 326
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost =
>> > > 9000.0, id = 324
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
>> > rowcount
>> > > = 18000.0, cumulative cost = 11.25, id = 302
>> > >
>> > > Is there any built-in Calcite rule that I can invoke which can do this
>> > type
>> > > of factorization of common where sub-conditions?
>> > >
>> > > Thanks!
>> > >
>> >
>>
>
Re: Nested loop joins
Posted by JiaTao Tao <ta...@gmail.com>.
In fact, it's a very common optimization that already implemented by other
engines like Spark.
Regards!
Aron Tao
JiaTao Tao <ta...@gmail.com> 于2021年3月2日周二 上午11:39写道:
> Hi
>
> I met the same problem and you can take a look at this JIRA.
> I also pull a request, but it does not get merged:
> https://github.com/apache/calcite/pull/2253
> This optimization I think is very common and useful, I'll continue this
> work soon.
>
> cc Julian
>
> Regards!
>
> Aron Tao
>
>
> Priyendra Deshwal <pr...@gmail.com> 于2021年3月1日周一 下午4:52写道:
>
>> That rule was already enabled for me. I also tried enabling a collection
>> of
>> other *_REDUCE_EXPRESSION rules but it did not help. Reading the
>> underlying
>> code, it suggests that these rules help with constant elimination but
>> perhaps do not do the kind of common expression factoring that may be
>> needed for this particular case.
>>
>>
>> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
>>
>> > Hi Priyendra,
>> >
>> > We have FilterReduceExpressionsRule which reduces filter conditions.
>> > It performs the simplification based on
>> org.apache.calcite.rex.RexSimplify.
>> >
>> > Best,
>> > Liya Fan
>> >
>> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
>> > wrote:
>> >
>> > > Hello friends,
>> > >
>> > > I am playing around with TPC-DS schema and playing with the following
>> > > simplified variant of query41.
>> > >
>> > > select distinct(i_product_name)
>> > > from item i1
>> > > where i_manufact_id between 738 and 738+40
>> > > and (select count(*) as item_cnt
>> > > from item
>> > > where (i_manufact = i1.i_manufact and i_category = 'Women') or
>> > > (i_manufact = i1.i_manufact and i_category =
>> 'Men'))
>> > > 0
>> > > order by i_product_name
>> > > limit 100
>> > >
>> > > This results in the following optimized plan. Note that the join
>> > condition
>> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
>> since
>> > > it is repeated in both OR clauses of the inner query. This results in
>> a
>> > > nested loop join and even the filter on i_category is not pushed all
>> the
>> > > way down to the query.
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 2.929266516286664E9, id = 408
>> > > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
>> cumulative
>> > > cost = 2.929266416286664E9, id = 406
>> > > EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
>> > cost
>> > > = 2.874591416286664E9, id = 404
>> > > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
>> > > 2.819916416286664E9, id = 410
>> > > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 386
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
>> > > EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 810000.0, cumulative cost = 8.193105E7, id = 398
>> > > EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2),
>> =($0,
>> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]):
>> rowcount =
>> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost
>> > =
>> > > 9000.0, id = 389
>> > > BindableTableScan(table=[[default, ITEM]],
>> projects=[[12,
>> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
>> 231
>> > > EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
>> > > cumulative cost = 10800.0, id = 394
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
>> > cost
>> > > = 9000.0, id = 392
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
>> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
>> > >
>> > > A simple rewrite of the query as follows where we "factor" out the
>> join
>> > > condition to the top level AND does make the plan significantly better
>> > > (hash join with complete filter push down to the source).
>> > >
>> > > select distinct(i_product_name)
>> > > from item i1
>> > > where i_manufact_id between 738 and 738+40
>> > > and (select count(*) as item_cnt
>> > > from item
>> > > where i_manufact = i1.i_manufact and
>> > > (i_category = 'Women' or i_category = 'Men')) > 0
>> > > order by i_product_name
>> > > limit 100
>> > >
>> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
>> > > 6523491.28666381, id = 336
>> > > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
>> cumulative
>> > > cost = 6523391.28666381, id = 334
>> > > EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
>> > cost
>> > > = 6401891.28666381, id = 332
>> > > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
>> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
>> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
>> > proj#0..4=[{exprs}],
>> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
>> > > 6280391.28666381, id = 338
>> > > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
>> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
>> > > 9000.0, id = 321
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
>> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
>> > > EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
>> rowcount
>> > =
>> > > 1800.0, cumulative cost = 11025.0, id = 326
>> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
>> cost =
>> > > 9000.0, id = 324
>> > > BindableTableScan(table=[[default, ITEM]],
>> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
>> > rowcount
>> > > = 18000.0, cumulative cost = 11.25, id = 302
>> > >
>> > > Is there any built-in Calcite rule that I can invoke which can do this
>> > type
>> > > of factorization of common where sub-conditions?
>> > >
>> > > Thanks!
>> > >
>> >
>>
>
Re: Nested loop joins
Posted by JiaTao Tao <ta...@gmail.com>.
Hi
I met the same problem and you can take a look at this JIRA.
I also pull a request, but it does not get merged:
https://github.com/apache/calcite/pull/2253
This optimization I think is very common and useful, I'll continue this
work soon.
cc Julian
Regards!
Aron Tao
Priyendra Deshwal <pr...@gmail.com> 于2021年3月1日周一 下午4:52写道:
> That rule was already enabled for me. I also tried enabling a collection of
> other *_REDUCE_EXPRESSION rules but it did not help. Reading the underlying
> code, it suggests that these rules help with constant elimination but
> perhaps do not do the kind of common expression factoring that may be
> needed for this particular case.
>
>
> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Priyendra,
> >
> > We have FilterReduceExpressionsRule which reduces filter conditions.
> > It performs the simplification based on
> org.apache.calcite.rex.RexSimplify.
> >
> > Best,
> > Liya Fan
> >
> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
> > wrote:
> >
> > > Hello friends,
> > >
> > > I am playing around with TPC-DS schema and playing with the following
> > > simplified variant of query41.
> > >
> > > select distinct(i_product_name)
> > > from item i1
> > > where i_manufact_id between 738 and 738+40
> > > and (select count(*) as item_cnt
> > > from item
> > > where (i_manufact = i1.i_manufact and i_category = 'Women') or
> > > (i_manufact = i1.i_manufact and i_category = 'Men'))
> > > 0
> > > order by i_product_name
> > > limit 100
> > >
> > > This results in the following optimized plan. Note that the join
> > condition
> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
> since
> > > it is repeated in both OR clauses of the inner query. This results in a
> > > nested loop join and even the filter on i_category is not pushed all
> the
> > > way down to the query.
> > >
> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > > 2.929266516286664E9, id = 408
> > > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
> cumulative
> > > cost = 2.929266416286664E9, id = 406
> > > EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
> > cost
> > > = 2.874591416286664E9, id = 404
> > > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> > proj#0..4=[{exprs}],
> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> > > 2.819916416286664E9, id = 410
> > > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > > 9000.0, id = 386
> > > BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
> > > EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
> rowcount
> > =
> > > 810000.0, cumulative cost = 8.193105E7, id = 398
> > > EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount
> =
> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
> cost
> > =
> > > 9000.0, id = 389
> > > BindableTableScan(table=[[default, ITEM]],
> projects=[[12,
> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
> 231
> > > EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> > > cumulative cost = 10800.0, id = 394
> > > EnumerableInterpreter: rowcount = 18000.0, cumulative
> > cost
> > > = 9000.0, id = 392
> > > BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
> > >
> > > A simple rewrite of the query as follows where we "factor" out the join
> > > condition to the top level AND does make the plan significantly better
> > > (hash join with complete filter push down to the source).
> > >
> > > select distinct(i_product_name)
> > > from item i1
> > > where i_manufact_id between 738 and 738+40
> > > and (select count(*) as item_cnt
> > > from item
> > > where i_manufact = i1.i_manufact and
> > > (i_category = 'Women' or i_category = 'Men')) > 0
> > > order by i_product_name
> > > limit 100
> > >
> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > > 6523491.28666381, id = 336
> > > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
> cumulative
> > > cost = 6523391.28666381, id = 334
> > > EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
> > cost
> > > = 6401891.28666381, id = 332
> > > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> > proj#0..4=[{exprs}],
> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> > > 6280391.28666381, id = 338
> > > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > > 9000.0, id = 321
> > > BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
> > > EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
> rowcount
> > =
> > > 1800.0, cumulative cost = 11025.0, id = 326
> > > EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> =
> > > 9000.0, id = 324
> > > BindableTableScan(table=[[default, ITEM]],
> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
> > rowcount
> > > = 18000.0, cumulative cost = 11.25, id = 302
> > >
> > > Is there any built-in Calcite rule that I can invoke which can do this
> > type
> > > of factorization of common where sub-conditions?
> > >
> > > Thanks!
> > >
> >
>
Re: Nested loop joins
Posted by Priyendra Deshwal <pr...@gmail.com>.
That rule was already enabled for me. I also tried enabling a collection of
other *_REDUCE_EXPRESSION rules but it did not help. Reading the underlying
code, it suggests that these rules help with constant elimination but
perhaps do not do the kind of common expression factoring that may be
needed for this particular case.
On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <li...@gmail.com> wrote:
> Hi Priyendra,
>
> We have FilterReduceExpressionsRule which reduces filter conditions.
> It performs the simplification based on org.apache.calcite.rex.RexSimplify.
>
> Best,
> Liya Fan
>
> On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
> wrote:
>
> > Hello friends,
> >
> > I am playing around with TPC-DS schema and playing with the following
> > simplified variant of query41.
> >
> > select distinct(i_product_name)
> > from item i1
> > where i_manufact_id between 738 and 738+40
> > and (select count(*) as item_cnt
> > from item
> > where (i_manufact = i1.i_manufact and i_category = 'Women') or
> > (i_manufact = i1.i_manufact and i_category = 'Men'))
> > 0
> > order by i_product_name
> > limit 100
> >
> > This results in the following optimized plan. Note that the join
> condition
> > (i_manufact = i1.i_manufact) is not clearly expressed in this query since
> > it is repeated in both OR clauses of the inner query. This results in a
> > nested loop join and even the filter on i_category is not pushed all the
> > way down to the query.
> >
> > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > 2.929266516286664E9, id = 408
> > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
> > cost = 2.929266416286664E9, id = 406
> > EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
> cost
> > = 2.874591416286664E9, id = 404
> > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> proj#0..4=[{exprs}],
> > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> > 2.819916416286664E9, id = 410
> > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
> > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > 9000.0, id = 386
> > BindableTableScan(table=[[default, ITEM]],
> > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
> > EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount
> =
> > 810000.0, cumulative cost = 8.193105E7, id = 398
> > EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
> > 8100000.0, cumulative cost = 8.10198E7, id = 396
> > EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> =
> > 9000.0, id = 389
> > BindableTableScan(table=[[default, ITEM]], projects=[[12,
> > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
> > EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> > cumulative cost = 10800.0, id = 394
> > EnumerableInterpreter: rowcount = 18000.0, cumulative
> cost
> > = 9000.0, id = 392
> > BindableTableScan(table=[[default, ITEM]],
> > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> > rowcount = 18000.0, cumulative cost = 11.25, id = 364
> >
> > A simple rewrite of the query as follows where we "factor" out the join
> > condition to the top level AND does make the plan significantly better
> > (hash join with complete filter push down to the source).
> >
> > select distinct(i_product_name)
> > from item i1
> > where i_manufact_id between 738 and 738+40
> > and (select count(*) as item_cnt
> > from item
> > where i_manufact = i1.i_manufact and
> > (i_category = 'Women' or i_category = 'Men')) > 0
> > order by i_product_name
> > limit 100
> >
> > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > 6523491.28666381, id = 336
> > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
> > cost = 6523391.28666381, id = 334
> > EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
> cost
> > = 6401891.28666381, id = 332
> > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> proj#0..4=[{exprs}],
> > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> > 6280391.28666381, id = 338
> > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
> > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > 9000.0, id = 321
> > BindableTableScan(table=[[default, ITEM]],
> > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
> > EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount
> =
> > 1800.0, cumulative cost = 11025.0, id = 326
> > EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > 9000.0, id = 324
> > BindableTableScan(table=[[default, ITEM]],
> > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
> rowcount
> > = 18000.0, cumulative cost = 11.25, id = 302
> >
> > Is there any built-in Calcite rule that I can invoke which can do this
> type
> > of factorization of common where sub-conditions?
> >
> > Thanks!
> >
>
Re: Nested loop joins
Posted by Fan Liya <li...@gmail.com>.
Hi Priyendra,
We have FilterReduceExpressionsRule which reduces filter conditions.
It performs the simplification based on org.apache.calcite.rex.RexSimplify.
Best,
Liya Fan
On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <pr...@gmail.com>
wrote:
> Hello friends,
>
> I am playing around with TPC-DS schema and playing with the following
> simplified variant of query41.
>
> select distinct(i_product_name)
> from item i1
> where i_manufact_id between 738 and 738+40
> and (select count(*) as item_cnt
> from item
> where (i_manufact = i1.i_manufact and i_category = 'Women') or
> (i_manufact = i1.i_manufact and i_category = 'Men')) > 0
> order by i_product_name
> limit 100
>
> This results in the following optimized plan. Note that the join condition
> (i_manufact = i1.i_manufact) is not clearly expressed in this query since
> it is repeated in both OR clauses of the inner query. This results in a
> nested loop join and even the filter on i_category is not pushed all the
> way down to the query.
>
> EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> 2.929266516286664E9, id = 408
> EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
> cost = 2.929266416286664E9, id = 406
> EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
> = 2.874591416286664E9, id = 404
> EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
> $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> 2.819916416286664E9, id = 410
> EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
> EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 386
> BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
> EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
> 810000.0, cumulative cost = 8.193105E7, id = 398
> EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
> 8100000.0, cumulative cost = 8.10198E7, id = 396
> EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 389
> BindableTableScan(table=[[default, ITEM]], projects=[[12,
> 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
> EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> cumulative cost = 10800.0, id = 394
> EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> = 9000.0, id = 392
> BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> rowcount = 18000.0, cumulative cost = 11.25, id = 364
>
> A simple rewrite of the query as follows where we "factor" out the join
> condition to the top level AND does make the plan significantly better
> (hash join with complete filter push down to the source).
>
> select distinct(i_product_name)
> from item i1
> where i_manufact_id between 738 and 738+40
> and (select count(*) as item_cnt
> from item
> where i_manufact = i1.i_manufact and
> (i_category = 'Women' or i_category = 'Men')) > 0
> order by i_product_name
> limit 100
>
> EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> 6523491.28666381, id = 336
> EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
> cost = 6523391.28666381, id = 334
> EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
> = 6401891.28666381, id = 332
> EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
> $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> 6280391.28666381, id = 338
> EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
> EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 321
> BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
> EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
> 1800.0, cumulative cost = 11025.0, id = 326
> EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 324
> BindableTableScan(table=[[default, ITEM]],
> filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
> = 18000.0, cumulative cost = 11.25, id = 302
>
> Is there any built-in Calcite rule that I can invoke which can do this type
> of factorization of common where sub-conditions?
>
> Thanks!
>