You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/07/15 18:01:00 UTC
[jira] [Commented] (IMPALA-1270) Consider adding distinct aggregation to subqueries as perf optimization

    [ https://issues.apache.org/jira/browse/IMPALA-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158589#comment-17158589 ] 

ASF subversion and git services commented on IMPALA-1270:
---------------------------------------------------------

Commit 63f5e8ec00d089dee7eee9fd47a931e356e2f985 in impala's branch refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=63f5e8e ]

IMPALA-1270: add distinct aggregation to semi joins

When generating plans with left semi/anti joins (typically
resulting from subquery rewrites), the planner now
considers inserting a distinct aggregation on the inner
side of the join. The decision is based on whether that
aggregation would reduce the number of rows by more than
75%. This is fairly conservative and the optimization
might be beneficial for smaller reductions, but the
conservative threshold is chosen to reduce the number
of potential plan regressions.

The aggregation can both reduce the # of rows and the
width of the rows, by projecting out unneeded slots.

ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION query option is
added to allow toggling the optimization.

Tests:
* Add positive and negative planner tests for various
  cases - including semi/anti joins, missing stats,
  broadcast/shuffle, different numbers of join predicates.
* Add some end-to-end tests to verify plans execute correctly.

Change-Id: Icbb955e805d9e764edf11c57b98f341b88a37fcc
Reviewed-on: http://gerrit.cloudera.org:8080/16180
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Consider adding distinct aggregation to subqueries as perf optimization
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-1270
>                 URL: https://issues.apache.org/jira/browse/IMPALA-1270
>             Project: IMPALA
>          Issue Type: Task
>          Components: Frontend
>    Affects Versions: Impala 2.0
>            Reporter: Nong Li
>            Assignee: Tim Armstrong
>            Priority: Minor
>              Labels: planner
>             Fix For: Impala 4.0
>
>
> We should consider other rewrites for exists. For q4, another rewrite is an inner join + distinct:
> {code}
> select
>   o_orderpriority,
>   count(distinct l_orderkey) as order_count
> from lineitem l
> inner join orders o
>   on (o.o_orderkey = l.l_orderkey and
>       l.l_commitdate < l.l_receiptdate)
> where
>   o_orderdate >= '1993-07-01' and
>   o_orderdate < '1993-10-01'
> group by
>   o_orderpriority
> order by
>   o_orderpriority
> {code}
> This can run much faster because we have more flexibility on how we execute the inner join. We get killed partitioning lineitem now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org