You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Alex Baden <al...@omnisci.com> on 2020/07/31 05:41:56 UTC

Pushing down aggregates through rhs of a left join

Hi all,

I have a query of the form:

SELECT a.x, SUM(b.y), SUM(b.z) FROM t1 a LEFT JOIN t2 b ON a.join_key
= b.join_key GROUP BY a.x ORDER BY a.x;

If table b has a large number of duplicate keys for `join_key`, the
left join can be very expensive to compute. Instead, we would like to
run:

WITH t2g AS (SELECT b.join_key, SUM(b.y), SUM(b.z) FROM t2 b GROUP BY
b.join_key)
SELECT a.x, SUM(bg.y), SUM(bg.z) FROM t1 a LEFT JOIN t2g bg ON
a.join_key = t2g.join_key GROUP BY a.x ORDER BY a.x;

Essentially, since we are only projecting aggregates from the rhs of
the join, and the aggregate functions are associative, we can group by
the join key to compute the aggregates up front, then join on the
grouped results, and finally aggregate among join matches.

Looking at the comments of the AGGREGATE_JOIN_TRANSPOSE_RULE, I noted
the following:
// OUTER joins are supported for group by without aggregate functions

But based on the above, if we have a left join I believe we can
transpose the aggregate and the join if the following conditions hold:
1) only expressions from the rhs of the join are aggregated
2) all aggregate functions from (1) are associative (can be split)
3) at least one expression from the lhs of the join is grouped

I am interested in implementing this rule (assuming the conditions
above are strong enough to guarantee correctness). Is there interest
in PRing something like this to calcite, either as a new rule or part
of AggregateJoinTranspose?

Thanks,
Alex

Re: Pushing down aggregates through rhs of a left join

Posted by Danny Chan <yu...@gmail.com>.
I didn’t see the derivation formula yet but I believe there is indeed some promotion space for the Agg Join transpose cases, Alex, can you log an issue there ?

Best,
Danny Chan
在 2020年7月31日 +0800 PM1:43,Alex Baden <al...@omnisci.com>,写道:
> Hi all,
>
> I have a query of the form:
>
> SELECT a.x, SUM(b.y), SUM(b.z) FROM t1 a LEFT JOIN t2 b ON a.join_key
> = b.join_key GROUP BY a.x ORDER BY a.x;
>
> If table b has a large number of duplicate keys for `join_key`, the
> left join can be very expensive to compute. Instead, we would like to
> run:
>
> WITH t2g AS (SELECT b.join_key, SUM(b.y), SUM(b.z) FROM t2 b GROUP BY
> b.join_key)
> SELECT a.x, SUM(bg.y), SUM(bg.z) FROM t1 a LEFT JOIN t2g bg ON
> a.join_key = t2g.join_key GROUP BY a.x ORDER BY a.x;
>
> Essentially, since we are only projecting aggregates from the rhs of
> the join, and the aggregate functions are associative, we can group by
> the join key to compute the aggregates up front, then join on the
> grouped results, and finally aggregate among join matches.
>
> Looking at the comments of the AGGREGATE_JOIN_TRANSPOSE_RULE, I noted
> the following:
> // OUTER joins are supported for group by without aggregate functions
>
> But based on the above, if we have a left join I believe we can
> transpose the aggregate and the join if the following conditions hold:
> 1) only expressions from the rhs of the join are aggregated
> 2) all aggregate functions from (1) are associative (can be split)
> 3) at least one expression from the lhs of the join is grouped
>
> I am interested in implementing this rule (assuming the conditions
> above are strong enough to guarantee correctness). Is there interest
> in PRing something like this to calcite, either as a new rule or part
> of AggregateJoinTranspose?
>
> Thanks,
> Alex

Re: Pushing down aggregates through rhs of a left join

Posted by Haisheng Yuan <hy...@apache.org>.
>  Is there interest
>  in PRing something like this to calcite, either as a new rule or part
>  of AggregateJoinTranspose?

Yes. It is better to be part of AggregateJoinTranspose.

On 2020/07/31 05:41:56, Alex Baden <al...@omnisci.com> wrote: 
> Hi all,
> 
> I have a query of the form:
> 
> SELECT a.x, SUM(b.y), SUM(b.z) FROM t1 a LEFT JOIN t2 b ON a.join_key
> = b.join_key GROUP BY a.x ORDER BY a.x;
> 
> If table b has a large number of duplicate keys for `join_key`, the
> left join can be very expensive to compute. Instead, we would like to
> run:
> 
> WITH t2g AS (SELECT b.join_key, SUM(b.y), SUM(b.z) FROM t2 b GROUP BY
> b.join_key)
> SELECT a.x, SUM(bg.y), SUM(bg.z) FROM t1 a LEFT JOIN t2g bg ON
> a.join_key = t2g.join_key GROUP BY a.x ORDER BY a.x;
> 
> Essentially, since we are only projecting aggregates from the rhs of
> the join, and the aggregate functions are associative, we can group by
> the join key to compute the aggregates up front, then join on the
> grouped results, and finally aggregate among join matches.
> 
> Looking at the comments of the AGGREGATE_JOIN_TRANSPOSE_RULE, I noted
> the following:
> // OUTER joins are supported for group by without aggregate functions
> 
> But based on the above, if we have a left join I believe we can
> transpose the aggregate and the join if the following conditions hold:
> 1) only expressions from the rhs of the join are aggregated
> 2) all aggregate functions from (1) are associative (can be split)
> 3) at least one expression from the lhs of the join is grouped
> 
> I am interested in implementing this rule (assuming the conditions
> above are strong enough to guarantee correctness). Is there interest
> in PRing something like this to calcite, either as a new rule or part
> of AggregateJoinTranspose?
> 
> Thanks,
> Alex
>