You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2015/02/25 00:23:04 UTC
[jira] [Resolved] (DRILL-2242) Wrong result (more rows) when outer
query groups by subset of columns that inner query groups by
[ https://issues.apache.org/jira/browse/DRILL-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jinfeng Ni resolved DRILL-2242.
-------------------------------
Resolution: Fixed
Fixed in commit 3c85bd8.
> Wrong result (more rows) when outer query groups by subset of columns that inner query groups by
> ------------------------------------------------------------------------------------------------
>
> Key: DRILL-2242
> URL: https://issues.apache.org/jira/browse/DRILL-2242
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 0.7.0
> Reporter: Aman Sinha
> Assignee: Jinfeng Ni
> Priority: Critical
> Attachments: 0002-DRILL-2242-Propagate-distribution-trait-when-Project.patch
>
>
> The following query has a subquery that groups on 2 columns and outer query that queries on 1 of those columns. With slice_target = 1 to force exchanges, it produces incorrect result:
> {code}
> alter session set `planner.slice_target` = 1;
> select count(*) from
> (select l_partksy from
> (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
> group by l_partkey, l_suppkey)
> group by l_partkey
> );
> +------------+
> | EXPR$0 |
> +------------+
> | 6227 |
> +------------+
> 1 row selected (1.522 seconds)
> {code}
> Correct result (from Postgres):
> {code}
> count
> -------
> 2000
> (1 row)
> {code}
> The cause appears to be related to distribution trait propagation. Here's the EXPLAIN plan:
> {code}
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
> 00-02 UnionExchange
> 01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])
> 01-02 Project($f0=[0])
> 01-03 HashAgg(group=[{0}])
> 01-04 Project(l_partkey=[$0])
> 01-05 HashAgg(group=[{0, 1}])
> 01-06 HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
> 02-01 HashAgg(group=[{0, 1}])
> 02-02 Project(l_partkey=[$1], l_suppkey=[$0])
> 02-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
> {code}
> Note that the HashExchange operator 06 does a distribute on 2 columns l_partkey and l_suppkey in order to perform the 2phase aggregation. These are the group-by columns. However, in the outer query's HashAgg, there is no re-distribution being done. It assumes that data is already hash distributed on l_partkey which is not correct.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)