You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2018/03/08 00:16:00 UTC

[jira] [Commented] (CALCITE-2202) Aggregate Join Push-down on a Single Side

    [ https://issues.apache.org/jira/browse/CALCITE-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390494#comment-16390494 ] 

Julian Hyde commented on CALCITE-2202:
--------------------------------------

Your analysis seems mostly correct.

One thing you forgot is "GROUP BY ()". "()" is a 0-tuple that has 1 value. "SELECT ... FROM t GROUP BY ()" has one row, even if t is empty. This can have some surprising effects if you are assuming, for instance, that an aggregate never has more rows than its input relation.

Outer joins are surprisingly common, so it's worth dealing with them properly. I am not convinced that you can apply your partitioning argument to outer-theta-join, for example.

You are correct that "COUNT(*)" counts all rows but I like to think of it as a special case of n-ary count where n = 0. (Usual COUNT is 1-ary, e.g. "COUNT(a)", but you can have 2-ary, "COUNT(a, b)", etc.) "COUNT", like most aggregate functions, ignores a row if any of its arguments are null, and 0-ary count clearly never receives a null argument.

When pushing expressions (filters and projects) through outer joins, we needed to know whether an expression "f(a)" being not-null after the join meant that "a" was not-null before the join. If that reasoning is useful for aggregates, see the {{Strong}} class.

Regarding your statement
{quote}it's apparent that aggregation can be pushed on on a single side (either side), and leave the other side non-aggregated, regardless of whether grouping columns are unique on the other side
{quote}
I don't agree. Duplicates do matter. Consider the following schema where there are duplicate rows with deptno = 10 in both emp and dept tables:
{code:java}
create table dept (deptno int, name char(20));
insert into dept values (10, 'sales');
insert into dept values (10, 'marketing');
insert into dept values (20, 'engineering');
create table emp (deptno int, sal int);
insert into emp values (10, 200);
insert into emp values (10, 300);
insert into emp values (10, 500);
select emp.deptno, sum(sal) as s
from emp join dept on emp.deptno = dept.deptno
group by emp.deptno;;
{code}
returns s = 2000 (200 * 2 + 300 * 2 + 500 * 2). I call the 2 the "cross-multiplier", because there are 2 values with deptno = 10 in dept.

The pushed-down query is
{code:java}
select sum(e.s * d.c)
from (select deptno, sum(sal) as s from emp) as e
join (select deptno, count(*) as c from dept) as d
on e.deptno = d.deptno
group by e.deptno{code}
and you can see the cross-multiplication in {{sum(e.s * d.c)}}.

If you know that {{dept.deptno}} is unique then you know that d.c is always 1, and so {{sum(e.s * d.c)}} becomes just {{sum(e.s)}}. That's how I'd approach it – first get the formula right without assuming uniqueness constraints, then simplify given the uniqueness constraints you have.

> Aggregate Join Push-down on a Single Side
> -----------------------------------------
>
>                 Key: CALCITE-2202
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2202
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: next
>            Reporter: Zhong Yu
>            Assignee: Julian Hyde
>            Priority: Major
>             Fix For: next
>
>
> While investigating https://issues.apache.org/jira/browse/CALCITE-2195, it's apparent that aggregation can be pushed on on a single side (either side), and leave the other side non-aggregated, regardless of whether grouping columns are unique on the other side. My analysis – [http://zhong-j-yu.github.io/aggregate-join-push-down.pdf] .
> This may be useful when the metadata is insufficient; in any case, we may try to provide all 3 possible transformations (aggregate on left only; right only; both sides) to the cost based optimizer, so that the cheapest one can be chosen based on stats. 
> Does this make any sense, anybody? If it sounds good, I'll implement it and offer a PR. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)