You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/01/28 17:30:37 UTC
[jira] [Created] (DRILL-2092) Incorrect result with count distinct and sum aggregates

Victoria Markman created DRILL-2092:
---------------------------------------

             Summary: Incorrect result with count distinct and sum aggregates
                 Key: DRILL-2092
                 URL: https://issues.apache.org/jira/browse/DRILL-2092
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.8.0
            Reporter: Victoria Markman
            Assignee: Jinfeng Ni


test.json
{code}
{ "a1" : 10 , "b1" : 10 }
{ "a1" : 20 , "b1" : 20 }
{ "a1" : 20 , "b1" : 20}
{ "a1" : 30 , "b1" : 30 }
{ "a1" : null , "b1": null}
{code}

{code}
0: jdbc:drill:schema=dfs> select a1, count(distinct a1) from `test.json` group by a1;
+------------+------------+
|     a1     |   EXPR$1   |
+------------+------------+
| 10         | 1          |
| 20         | 1          |
| 30         | 1          |
| null       | 0          |
+------------+------------+
4 rows selected (0.096 seconds)
{code}

If  I add sum on the same column, I  get wrong result (null group is gone):

{code}
0: jdbc:drill:schema=dfs> select a1, count(distinct a1), sum(a1) from `test.json` group by a1;
+------------+------------+------------+
|     a1     |   EXPR$1   |   EXPR$2   |
+------------+------------+------------+
| 10         | 1          | 10         |
| 20         | 1          | 40         |
| 30         | 1          | 30         |
+------------+------------+------------+
3 rows selected (0.137 seconds)
{code}

Non-distinct count works correctly:

{code}
0: jdbc:drill:schema=dfs> select a1, count(a1), sum(a1) from `test.json` group by a1;
+------------+------------+------------+
|     a1     |   EXPR$1   |   EXPR$2   |
+------------+------------+------------+
| 10         | 1          | 10         |
| 20         | 2          | 40         |
| 30         | 1          | 30         |
| null       | 0          | null       |
+------------+------------+------------+
4 rows selected (0.187 seconds)
{code}

Plan for the query with the wrong result:
{code}
00-01      Project(a1=[$0], EXPR$1=[$1], EXPR$2=[$2])
00-02        Project(a1=[$0], EXPR$1=[$3], EXPR$2=[$1])
00-03          HashJoin(condition=[IS NOT DISTINCT FROM($0, $2)], joinType=[inner])
00-05            HashAgg(group=[{0}], EXPR$2=[SUM($0)])
00-07              Scan(groupscan=[EasyGroupScan [selectionRoot=/test.json, numFiles=1, columns=[`a1`], files=[maprfs:/test.json]]])
00-04            Project(a10=[$0], EXPR$1=[$1])
00-06              HashAgg(group=[{0}], EXPR$1=[COUNT($0)])
00-08                HashAgg(group=[{0}])
00-09                  Scan(groupscan=[EasyGroupScan [selectionRoot=/test.json, numFiles=1, columns=[`a1`], files=[maprfs:/test.json]]])
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)