You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/04/28 02:11:06 UTC
[jira] [Closed] (DRILL-1977) Wrong result with aggregation on top of UNION ALL operator

     [ https://issues.apache.org/jira/browse/DRILL-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Victoria Markman closed DRILL-1977.
-----------------------------------

> Wrong result with aggregation on top of UNION ALL operator
> ----------------------------------------------------------
>
>                 Key: DRILL-1977
>                 URL: https://issues.apache.org/jira/browse/DRILL-1977
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Victoria Markman
>            Assignee: Sean Hsuan-Yi Chu
>            Priority: Critical
>             Fix For: 0.8.0
>
>
> {code}
> Jan 10 01:17:13 root@~ ] # hadoop fs -cat /test/t.json
> { "a1": 0, "b1": 0, "c1": "true",   "d1" : "2015-01-02"}
> { "a1": 0, "b1": 0, "c1": "false" , "d1" : "2015-01-03"}
> { "a1": 0, "b1": 0, "c1": "false" , "d1" : "2015-01-04"}
> { "a1": 1, "b1": 1, "c1": "true" ,  "d1" : "2015-01-05"}
> { "a1": 1, "b1": 1, "c1": "true" ,  "d1" : "2015-01-06"}
> {code}
> Query below returns wrong result, we should have 4 groups, instead we get only two groups from right side of the union.
> Notice lack of project step above union all (unclear if it has anything to to with the bug)
> {code}
> 0: jdbc:drill:schema=dfs> select calc1, max(a1), min(b1), count(c1) from (select a1+10 as calc1, a1, b1, c1 from `t.json` union all select a1+100 as calc1, a1, b1, c1 from `t.json`) group by calc1;
> +------------+------------+------------+------------+
> |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 100        | 0          | 0          | 3          |
> | 101        | 1          | 1          | 2          |
> +------------+------------+------------+------------+
> 2 rows selected (0.181 seconds)
> 00-01      Project(calc1=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02        HashAgg(group=[{0}], EXPR$1=[MAX($1)], EXPR$2=[MIN($2)], EXPR$3=[COUNT($3)])
> 00-03          UnionAll(all=[true])
> 00-05            Project(calc1=[+($2, 10)], a1=[$2], b1=[$1], c1=[$0])
> 00-07              Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
> 00-04            Project(calc1=[+($2, 100)], a1=[$2], b1=[$1], c1=[$0])
> 00-06              Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
> {code}
> Correct result, when order of the columns is different in aggregate functions: "min(b1), max(a1)" instead of "max(a1), min(b1)"
> {code}
> 0: jdbc:drill:schema=dfs> select calc1, min(b1), max(a1), count(c1) from (select a1+10 as calc1, a1, b1, c1 from `t.json` union all select a1+100 as calc1, a1, b1, c1 from `t.json`) group by calc1;
> +------------+------------+------------+------------+
> |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 10         | 0          | 0          | 3          |
> | 11         | 1          | 1          | 2          |
> | 100        | 0          | 0          | 3          |
> | 101        | 1          | 1          | 2          |
> +------------+------------+------------+------------+
> 4 rows selected (0.256 seconds)
> 00-01      Project(calc1=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02        HashAgg(group=[{0}], EXPR$1=[MIN($1)], EXPR$2=[MAX($2)], EXPR$3=[COUNT($3)])
> 00-03          Project(calc1=[$0], b1=[$2], a1=[$1], c1=[$3])
> 00-04            UnionAll(all=[true])
> 00-06              Project(calc1=[+($2, 10)], a1=[$2], b1=[$1], c1=[$0])
> 00-08                Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]])
> 00-05              Project(calc1=[+($2, 100)], a1=[$2], b1=[$1], c1=[$0])
> 00-07                Scan(groupscan=[EasyGroupScan [selectionRoot=/test/t.json, numFiles=1, columns=[`a1`, `b1`, `c1`], files=[maprfs:/test/t.json]]]){code}
> Non-union query works correctly:
> {code}
> 0: jdbc:drill:schema=dfs> select a1+10 as calc1, min(a1), max(b1), count(c1) from `t.json` group by a1+10;
> +------------+------------+------------+------------+
> |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 10         | 0          | 0          | 3          |
> | 11         | 1          | 1          | 2          |
> +------------+------------+------------+------------+
> 2 rows selected (0.142 seconds)
> 0: jdbc:drill:schema=dfs> select a1+10 as calc1, min(b1), max(a1), count(c1) from `t.json` group by a1+10;
> +------------+------------+------------+------------+
> |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 10         | 0          | 0          | 3          |
> | 11         | 1          | 1          | 2          |
> +------------+------------+------------+------------+
> 2 rows selected (0.153 seconds)
> {code}
> If I switch legs of a UNION, we still get wrong result, but from the right side:
> {code}
> 0: jdbc:drill:schema=dfs> select calc1, max(a1), min(b1), count(c1) from (select a1+100 as calc1, a1, b1, c1 from `t.json` union all select a1+10 as calc1, a1, b1, c1 from `t.json`) group by calc1;
> +------------+------------+------------+------------+
> |   calc1    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 10         | 0          | 0          | 3          |
> | 11         | 1          | 1          | 2          |
> +------------+------------+------------+------------+
> 2 rows selected (0.19 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)