You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2015/06/11 18:23:00 UTC
[jira] [Commented] (DRILL-3277) SUM(CAST(columns[0] AS INT))
OVER(...) gives wrong results
[ https://issues.apache.org/jira/browse/DRILL-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582176#comment-14582176 ]
Deneche A. Hakim commented on DRILL-3277:
-----------------------------------------
I attached a CSV file that shows the problem. The file contains one single value for column 2 which means window function will only handle one partition.
The following queries should give the same result but they don't:
{noformat}
SELECT SUM(CAST(columns[0] AS INT)) FROM `onecol-sm.csv`;
+--------------+
| EXPR$0 |
+--------------+
| -3216087191 |
+--------------+
{noformat}
{noformat}
SELECT SUM(CAST(columns[0] AS INT)) OVER(PARTITION BY columns[1]) FROM `onecol-sm.csv` LIMIT 1;
+-------------+
| EXPR$0 |
+-------------+
| 1078880105 |
+-------------+
{noformat}
Here is the plan for the window query:
{noformat}
EXPLAIN PLAN FOR SELECT SUM(CAST(columns[0] AS INT)) OVER(PARTITION BY columns[1]) FROM `onecol-sm.csv`;
00-00 Screen
00-01 Project(EXPR$0=[CASE(>($2, 0), CAST($3):INTEGER, null)])
00-02 Window(window#0=[window(partition {1} order by [] range between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [COUNT($0), $SUM0($0)])])
00-03 SelectionVectorRemover
00-04 Sort(sort0=[$1], dir0=[ASC])
00-05 Project($0=[CAST(ITEM($0, 0)):INTEGER], $1=[ITEM($0, 1)])
00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=/Users/hakim/MapR/data/onecol-sm.csv, numFiles=1, columns=[`columns`[0], `columns`[1]], files=[file:/Users/hakim/MapR/data/onecol-sm.csv]]])
{noformat}
Using a hack allowed by DRILL-3210 we can look at the value computed by the window function:
{noformat}
SELECT *, SUM(CAST(columns[0] AS INT)) OVER(PARTITION BY columns[1]) FROM `onecol-sm.csv` LIMIT 1;
+--------------------+-----------+-----+--------+--------------+-------------+-------------+
| T4¦¦columns | $1 | $2 | w0$o0 | w0$o1 | EXPR$1 | EXPR$10 |
+--------------------+-----------+-----+--------+--------------+-------------+-------------+
| ["55445062","AK"] | 55445062 | AK | 29 | -3216087191 | 1078880105 | 1078880105 |
+--------------------+-----------+-----+--------+--------------+-------------+-------------+
{noformat}
We can see that the value computed by the window function (column w04o1) is actually correct,
but then the project 00-01 casts back the value to INTEGER. This value cannot be stored as in int so it overflows.
> SUM(CAST(columns[0] AS INT)) OVER(...) gives wrong results
> ----------------------------------------------------------
>
> Key: DRILL-3277
> URL: https://issues.apache.org/jira/browse/DRILL-3277
> Project: Apache Drill
> Issue Type: Sub-task
> Components: Execution - Flow
> Reporter: Deneche A. Hakim
> Assignee: Deneche A. Hakim
> Labels: window_function
> Fix For: 1.1.0
>
> Attachments: onecol-sm.csv
>
>
> The following query return wrong results, there is a difference in actual vs expected results. Expected results were generated on Postgres 9.3
> {noformat}
> SELECT
> SUM(cast( columns[0] as INT ))
> OVER(PARTITION BY cast( columns[2] as CHAR(2)))
> FROM `allData.csv`;
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)