You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2015/06/11 18:23:00 UTC
[jira] [Commented] (DRILL-3277) SUM(CAST(columns[0] AS INT)) OVER(...) gives wrong results

    [ https://issues.apache.org/jira/browse/DRILL-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582176#comment-14582176 ] 

Deneche A. Hakim commented on DRILL-3277:
-----------------------------------------

I attached a CSV file that shows the problem. The file contains one single value for column 2 which means window function will only handle one partition. 

The following queries should give the same result but they don't:
{noformat}
SELECT SUM(CAST(columns[0] AS INT)) FROM `onecol-sm.csv`;
+--------------+
|    EXPR$0    |
+--------------+
| -3216087191  |
+--------------+
{noformat}

{noformat}
SELECT SUM(CAST(columns[0] AS INT)) OVER(PARTITION BY columns[1]) FROM `onecol-sm.csv` LIMIT 1;
+-------------+
|   EXPR$0    |
+-------------+
| 1078880105  |
+-------------+
{noformat}

Here is the plan for the window query:
{noformat}
EXPLAIN PLAN FOR SELECT SUM(CAST(columns[0] AS INT)) OVER(PARTITION BY columns[1]) FROM `onecol-sm.csv`;
00-00    Screen
00-01      Project(EXPR$0=[CASE(>($2, 0), CAST($3):INTEGER, null)])
00-02        Window(window#0=[window(partition {1} order by [] range between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [COUNT($0), $SUM0($0)])])
00-03          SelectionVectorRemover
00-04            Sort(sort0=[$1], dir0=[ASC])
00-05              Project($0=[CAST(ITEM($0, 0)):INTEGER], $1=[ITEM($0, 1)])
00-06                Scan(groupscan=[EasyGroupScan [selectionRoot=/Users/hakim/MapR/data/onecol-sm.csv, numFiles=1, columns=[`columns`[0], `columns`[1]], files=[file:/Users/hakim/MapR/data/onecol-sm.csv]]])
{noformat}

Using a hack allowed by DRILL-3210 we can look at the value computed by the window function:
{noformat}
SELECT *, SUM(CAST(columns[0] AS INT)) OVER(PARTITION BY columns[1]) FROM `onecol-sm.csv` LIMIT 1;
+--------------------+-----------+-----+--------+--------------+-------------+-------------+
|    T4¦¦columns     |    $1     | $2  | w0$o0  |    w0$o1     |   EXPR$1    |   EXPR$10   |
+--------------------+-----------+-----+--------+--------------+-------------+-------------+
| ["55445062","AK"]  | 55445062  | AK  | 29     | -3216087191  | 1078880105  | 1078880105  |
+--------------------+-----------+-----+--------+--------------+-------------+-------------+
{noformat}

We can see that the value computed by the window function (column w04o1) is actually correct, 
but then the project 00-01 casts back the value to INTEGER. This value cannot be stored as in int so it overflows.

> SUM(CAST(columns[0] AS INT)) OVER(...) gives wrong results
> ----------------------------------------------------------
>
>                 Key: DRILL-3277
>                 URL: https://issues.apache.org/jira/browse/DRILL-3277
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Execution - Flow
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>              Labels: window_function
>             Fix For: 1.1.0
>
>         Attachments: onecol-sm.csv
>
>
> The following query return wrong results, there is a difference in actual vs expected results. Expected results were generated on Postgres 9.3
> {noformat}
> SELECT 
>   SUM(cast( columns[0] as INT )) 
>     OVER(PARTITION BY cast( columns[2] as CHAR(2))) 
> FROM `allData.csv`;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)