You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Sean Broeder <se...@dremio.com> on 2020/07/07 18:57:24 UTC

Correlated subquery losing aggregator operator in inner join?

Hi,
I am following up on JIRA-4100.  I am trying to understand the plan generated by the SQL query select  e.empno, e.sal, e.deptno emp_dept, d.deptno dep_dept  
from emp e 
left join
dept d
    on e.deptno = (
         select max(sal)
         from emp
         where deptno = e.deptno)

The plan produced is as follows
LogicalProject(EMPNO=[$0], SAL=[$5], EMP_DEPT=[$7], DEP_DEPT=[$9])
  LogicalJoin(condition=[=($0, $7)], joinType=[left])
    LogicalTableScan(table=[[CATALOG, SALES, EMP]])
    LogicalTableScan(table=[[CATALOG, SALES, DEPT]])

It seems to me that the MAX operator is still needed in the correlated subquery before the join, but it is dropped in the aggregator evaluation.  

Interestingly, a similar query using EXISTS retains the aggregator operator:
Select * from emp where exists select 1 from dept where emp.deptno=dept.deptno

This results in the following plan
LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8])
  LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], DEPTNO0=[CAST($9):INTEGER], $f1=[CAST($10):BOOLEAN])
    LogicalJoin(condition=[=($7, $9)], joinType=[inner])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
        LogicalProject(DEPTNO=[$0], $f0=[true])
          LogicalTableScan(table=[[CATALOG, SALES, DEPT]])


Based on the exists example, I expect the logical plan should also contain something like after the scan on dept:
      LogicalAggregate(group=[{0}], agg#0=[MAX($5)])
        LogicalProject(DEPTNO=[$9], $7=[true])

Thanks in advance for your input,
Sean



Re: Correlated subquery losing aggregator operator in inner join?

Posted by Rui Wang <am...@apache.org>.
Since this discussion is moved to email thread, I just copy something
I found and replied in that JIRA for the reference:

SqlToRelConverter ends with creating a RexSubQuery for that subquery,
and leaves it to JOIN condition. The string representation of
RexSubQuery seems just $7. If you check RexSubQuery, it hosts a
RelNode [1] for that subquery. So that operator is not dropped. It
might be just the intention of design, in which engines to implement
LogicalJoin can still access the subquery's RelNode and create the
correct execution plan.

There are some special logic to replace subquery and insert logical
nodes into plan tree (e.g. for IN) [2]. If a change is needed, likely
it should happen in [2].

[1]: https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rex/RexSubQuery.java#L39
[2]: https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1054


-Rui


On Tue, Jul 7, 2020 at 11:57 AM Sean Broeder <se...@dremio.com> wrote:
>
> Hi,
> I am following up on JIRA-4100.  I am trying to understand the plan generated by the SQL query select  e.empno, e.sal, e.deptno emp_dept, d.deptno dep_dept
> from emp e
> left join
> dept d
>     on e.deptno = (
>          select max(sal)
>          from emp
>          where deptno = e.deptno)
>
> The plan produced is as follows
> LogicalProject(EMPNO=[$0], SAL=[$5], EMP_DEPT=[$7], DEP_DEPT=[$9])
>   LogicalJoin(condition=[=($0, $7)], joinType=[left])
>     LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>     LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
>
> It seems to me that the MAX operator is still needed in the correlated subquery before the join, but it is dropped in the aggregator evaluation.
>
> Interestingly, a similar query using EXISTS retains the aggregator operator:
> Select * from emp where exists select 1 from dept where emp.deptno=dept.deptno
>
> This results in the following plan
> LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8])
>   LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], DEPTNO0=[CAST($9):INTEGER], $f1=[CAST($10):BOOLEAN])
>     LogicalJoin(condition=[=($7, $9)], joinType=[inner])
>       LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>       LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>         LogicalProject(DEPTNO=[$0], $f0=[true])
>           LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
>
>
> Based on the exists example, I expect the logical plan should also contain something like after the scan on dept:
>       LogicalAggregate(group=[{0}], agg#0=[MAX($5)])
>         LogicalProject(DEPTNO=[$9], $7=[true])
>
> Thanks in advance for your input,
> Sean
>
>