You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/08/31 18:35:00 UTC

[jira] [Commented] (IMPALA-5260) Have query optimizer make joined tables distinct to improve performance

    [ https://issues.apache.org/jira/browse/IMPALA-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187946#comment-17187946 ] 

ASF subversion and git services commented on IMPALA-5260:
---------------------------------------------------------

Commit 827070b473c02da480f3a9d77c59f7031f9070c2 in impala's branch refs/heads/master from Shant Hovsepian
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=827070b ]

IMPALA-10099: Push down DISTINCT in Set operations

INTERSECT/EXCEPT are not duplicate preserving operations. The distinct
aggregations can happen in each operand, the leftmost operand only, or
after all the operands in a separate aggregation step. Except for a
couple special cases we would use the last strategy most often.

This change pushes the distinct aggregation down to the leftmost operand
in cases where there are no analytic functions, or when a distinct or
grouping operation already eliminates duplicates.

In general DISTINCT placement such as in this case should be done
throughout the entire plan tree in a cost based manner as described in
IMPALA-5260

Testing:
 * TpcdsPlannerTest
 * PlannerTest
 * TPC-DS 30TB Perf run for any affected queries
   - Q14-1 180s -> 150s
   - Q14-2 109s -> 90s
   - Q8 no significant change
 * SetOperation Planner Tests
 * Analyzer tests
 * Tpcds Functional Workload

Change-Id: Ia248f1595df2ab48fbe70c778c7c32bde5c518a5
Reviewed-on: http://gerrit.cloudera.org:8080/16350
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>


> Have query optimizer make joined tables distinct to improve performance
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-5260
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5260
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
>            Reporter: Michael Sokalski
>            Priority: Minor
>              Labels: perfomance, planner
>
> Consider the following select statement:
> {code:sql}
> select tB.bField, count(tA.aField) ct
> from tableA tA
> join tableB tB using (id)
> where (...)
> group by tB.bField
> order by ct
> {code}
> if tableB has a large number of rows (but still less than tableA), performance can be orders of magnitude slower than the equivalent query:
> {code:sql}
> select tB.bField, count(tA.aField) ct
> from tableA tA
> join (select distinct bField, id[, ...] from tableB) tB using (id)
> where (...)
> group by tB.bField
> order by ct
> {code}
> It appears to me that the slower query gets bogged down with shuttling unnecessary data between nodes.
> Is it possible, and beneficial, to make such a query improvement implicit in Impala's query optimizer?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org