You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2015/02/25 00:23:04 UTC
[jira] [Resolved] (DRILL-2242) Wrong result (more rows) when outer query groups by subset of columns that inner query groups by

     [ https://issues.apache.org/jira/browse/DRILL-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jinfeng Ni resolved DRILL-2242.
-------------------------------
    Resolution: Fixed

Fixed in commit 3c85bd8. 

> Wrong result (more rows) when outer query groups by subset of columns that inner query groups by
> ------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2242
>                 URL: https://issues.apache.org/jira/browse/DRILL-2242
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0
>            Reporter: Aman Sinha
>            Assignee: Jinfeng Ni
>            Priority: Critical
>         Attachments: 0002-DRILL-2242-Propagate-distribution-trait-when-Project.patch
>
>
> The following query has a subquery that groups on 2 columns and outer query that queries on 1 of those columns.  With slice_target = 1 to force exchanges, it produces incorrect result: 
> {code}
> alter session set `planner.slice_target` = 1;
> select count(*) from 
>  (select l_partksy from
>    (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
>       group by l_partkey, l_suppkey) 
>    group by l_partkey
>  );
> +------------+
> |   EXPR$0   |
> +------------+
> | 6227       |
> +------------+
> 1 row selected (1.522 seconds)
> {code}
> Correct result (from Postgres): 
> {code}
>  count
> -------
>   2000
> (1 row)
> {code}
> The cause appears to be related to distribution trait propagation.  Here's the EXPLAIN plan: 
> {code}
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
> 00-02        UnionExchange
> 01-01          StreamAgg(group=[{}], EXPR$0=[COUNT()])
> 01-02            Project($f0=[0])
> 01-03              HashAgg(group=[{0}])
> 01-04                Project(l_partkey=[$0])
> 01-05                  HashAgg(group=[{0, 1}])
> 01-06                    HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
> 02-01                      HashAgg(group=[{0, 1}])
> 02-02                        Project(l_partkey=[$1], l_suppkey=[$0])
> 02-03                          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
> {code}
> Note that the HashExchange operator 06 does a distribute on 2 columns l_partkey and l_suppkey in order to perform the 2phase aggregation. These are the group-by columns.  However, in the outer query's HashAgg, there is no re-distribution being done.  It assumes that data is already hash distributed on l_partkey which is not correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)