You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/06/24 08:09:42 UTC
[jira] [Created] (DRILL-3352) Extra re-distribution when evaluating
window function after GROUP BY
Aman Sinha created DRILL-3352:
---------------------------------
Summary: Extra re-distribution when evaluating window function after GROUP BY
Key: DRILL-3352
URL: https://issues.apache.org/jira/browse/DRILL-3352
Project: Apache Drill
Issue Type: Bug
Components: Query Planning & Optimization
Affects Versions: 1.0.0
Reporter: Aman Sinha
Assignee: Aman Sinha
Consider the following query and plan:
{code}
explain plan for select min(l_partkey) over (partition by l_suppkey) from lineitem group by l_partkey, l_suppkey limit 1;
00-00 Screen
00-01 Project(EXPR$0=[$0])
00-02 SelectionVectorRemover
00-03 Limit(fetch=[1])
00-04 UnionExchange
01-01 Project($0=[$3])
01-02 Window(window#0=[window(partition {1} order by [] range between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [MIN($0)])])
01-03 SelectionVectorRemover
01-04 Sort(sort0=[$1], dir0=[ASC])
01-05 Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
01-06 HashToRandomExchange(dist0=[[$1]])
02-01 UnorderedMuxExchange
03-01 Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
03-02 HashAgg(group=[{0, 1}], agg#0=[MIN($2)])
03-03 Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
03-04 HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
04-01 UnorderedMuxExchange
05-01 Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1, hash64AsDouble($0)))])
05-02 HashAgg(group=[{0, 1}], agg#0=[MIN($0)])
05-03 Project(l_partkey=[$1], l_suppkey=[$0])
05-04 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/asinha/data/tpch-sf1/lineitem]], selectionRoot=/Users/asinha/data/tpch-sf1/lineitem, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
{code}
Here, we do a distribution for the HashAgg on 2 columns: {l_partkey, l_suppkey}. Subsequently, we re-distribute on {l_suppkey} only since the window function has a partition-by l_suppkey. The second re-distribute could be avoided if the first distribution for the HashAgg was done on l_suppkey only. The reason we do distribution on all grouping columns is to avoid skew problems. However, in many cases especially when a window function is involved, it may make sense to only distribute on 1 column.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)