You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/02/13 04:13:11 UTC

[jira] [Created] (DRILL-2242) Wrong result (more rows) when outer query groups by subset of columns that inner query groups by

Aman Sinha created DRILL-2242:
---------------------------------

             Summary: Wrong result (more rows) when outer query groups by subset of columns that inner query groups by
                 Key: DRILL-2242
                 URL: https://issues.apache.org/jira/browse/DRILL-2242
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.7.0
            Reporter: Aman Sinha
            Assignee: Aman Sinha
            Priority: Critical


The following query has a subquery that groups on 2 columns and outer query that queries on 1 of those columns.  With slice_target = 1 to force exchanges, it produces incorrect result: 
{code}
alter session set `planner.slice_target` = 1;
select count(*) from 
 (select l_partksy from
   (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
      group by l_partkey, l_suppkey) 
   group by l_partkey
 );

+------------+
|   EXPR$0   |
+------------+
| 6227       |
+------------+
1 row selected (1.522 seconds)
{code}

Correct result (from Postgres): 
{code}
 count
-------
  2000
(1 row)
{code}

The cause appears to be related to distribution trait propagation.  Here's the EXPLAIN plan: 
{code}
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
00-02        UnionExchange
01-01          StreamAgg(group=[{}], EXPR$0=[COUNT()])
01-02            Project($f0=[0])
01-03              HashAgg(group=[{0}])
01-04                Project(l_partkey=[$0])
01-05                  HashAgg(group=[{0, 1}])
01-06                    HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
02-01                      HashAgg(group=[{0, 1}])
02-02                        Project(l_partkey=[$1], l_suppkey=[$0])
02-03                          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])
{code}

Note that the HashExchange operator 06 does a distribute on 2 columns l_partkey and l_suppkey in order to perform the 2phase aggregation. These are the group-by columns.  However, in the outer query's HashAgg, there is no re-distribution being done.  It assumes that data is already hash distributed on l_partkey which is not correct. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)