You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/02/13 04:13:11 UTC
[jira] [Created] (DRILL-2242) Wrong result (more rows) when outer
query groups by subset of columns that inner query groups by
Aman Sinha created DRILL-2242:
---------------------------------
Summary: Wrong result (more rows) when outer query groups by subset of columns that inner query groups by
Key: DRILL-2242
URL: https://issues.apache.org/jira/browse/DRILL-2242
Project: Apache Drill
Issue Type: Bug
Components: Query Planning & Optimization
Affects Versions: 0.7.0
Reporter: Aman Sinha
Assignee: Aman Sinha
Priority: Critical
The following query has a subquery that groups on 2 columns and outer query that queries on 1 of those columns. With slice_target = 1 to force exchanges, it produces incorrect result:
{code}
alter session set `planner.slice_target` = 1;
select count(*) from
(select l_partksy from
(select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
group by l_partkey, l_suppkey)
group by l_partkey
);
+------------+
| EXPR$0 |
+------------+
| 6227 |
+------------+
1 row selected (1.522 seconds)
{code}
Correct result (from Postgres):
{code}
count
-------
2000
(1 row)
{code}
The cause appears to be related to distribution trait propagation. Here's the EXPLAIN plan:
{code}
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
00-02 UnionExchange
01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])
01-02 Project($f0=[0])
01-03 HashAgg(group=[{0}])
01-04 Project(l_partkey=[$0])
01-05 HashAgg(group=[{0, 1}])
01-06 HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
02-01 HashAgg(group=[{0, 1}])
02-02 Project(l_partkey=[$1], l_suppkey=[$0])
02-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
{code}
Note that the HashExchange operator 06 does a distribute on 2 columns l_partkey and l_suppkey in order to perform the 2phase aggregation. These are the group-by columns. However, in the outer query's HashAgg, there is no re-distribution being done. It assumes that data is already hash distributed on l_partkey which is not correct.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)