You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/06/24 08:09:42 UTC

[jira] [Created] (DRILL-3352) Extra re-distribution when evaluating window function after GROUP BY

Aman Sinha created DRILL-3352:
---------------------------------

             Summary: Extra re-distribution when evaluating window function after GROUP BY
                 Key: DRILL-3352
                 URL: https://issues.apache.org/jira/browse/DRILL-3352
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 1.0.0
            Reporter: Aman Sinha
            Assignee: Aman Sinha


Consider the following query and plan: 
{code}
explain plan for select min(l_partkey) over (partition by l_suppkey) from lineitem group by l_partkey, l_suppkey limit 1;

00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        SelectionVectorRemover
00-03          Limit(fetch=[1])
00-04            UnionExchange
01-01              Project($0=[$3])
01-02                Window(window#0=[window(partition {1} order by [] range between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [MIN($0)])])
01-03                  SelectionVectorRemover
01-04                    Sort(sort0=[$1], dir0=[ASC])
01-05                      Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
01-06                        HashToRandomExchange(dist0=[[$1]])
02-01                          UnorderedMuxExchange
03-01                            Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
03-02                              HashAgg(group=[{0, 1}], agg#0=[MIN($2)])
03-03                                Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
03-04                                  HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
04-01                                    UnorderedMuxExchange
05-01                                      Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1, hash64AsDouble($0)))])
05-02                                        HashAgg(group=[{0, 1}], agg#0=[MIN($0)])
05-03                                          Project(l_partkey=[$1], l_suppkey=[$0])
05-04                                            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/asinha/data/tpch-sf1/lineitem]], selectionRoot=/Users/asinha/data/tpch-sf1/lineitem, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
{code}

Here, we do a distribution for the HashAgg on 2 columns: {l_partkey, l_suppkey}.  Subsequently, we re-distribute on {l_suppkey} only since the window function has a partition-by l_suppkey.  The second re-distribute could be avoided if the first distribution for the HashAgg was done on l_suppkey only.   The reason we do distribution on all grouping columns is to avoid skew problems.   However, in many cases especially when a window function is involved, it may make sense to only distribute on 1 column. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)