You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2014/10/03 20:00:34 UTC
[jira] [Commented] (DRILL-1487) Drill window functions return wrong results

    [ https://issues.apache.org/jira/browse/DRILL-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158251#comment-14158251 ] 

Jinfeng Ni commented on DRILL-1487:
-----------------------------------

Looks like the incorrect results issue is caused by missing the following rows that are equal to the current row w.r.t the ORDER BY CLAUSE in the window frame. Here is the explanation of the concept of window frame in posgresql: 
 
"There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition."

If there is order by clause in the window definition, then, the window aggregate function would be applied to the windows frame, which consists the row fro the start of the partition up through the current row, plus any following rows that are equal to the current row w.r.t ORDER BY clause.  In the original query, since ORDER BY is same as the PARTITION column, the window aggregate function should be applied to all the rows in the partition, while the current implementation seems to miss the rows of "that are equal to the current row according to the ORDER BY clause.".

If re-modify the original query into the following query, we can clearly see that avg_sal is computed over the start row to the current row, but ignore the following rows that are equal to current row accord to the ORDER BY clause.  

{code}

SELECT employee_id,position_id, salary, avg(salary) OVER (PARTITION BY position_id order by position_id) as avg_sal FROM cp.`employee.json` order by position_id limit 50;
+-------------+-------------+------------+------------+
| employee_id | position_id |   salary   |  avg_sal   |
+-------------+-------------+------------+------------+
| 1           | 1           | 80000.0    | 80000.0    |
| 5           | 2           | 35000.0    | 35000.0    |
| 2           | 2           | 40000.0    | 37500.0    |
| 4           | 2           | 40000.0    | 38333.333333333336 |
| 20          | 2           | 30000.0    | 36250.0    |
| 21          | 2           | 35000.0    | 36000.0    |
| 22          | 2           | 35000.0    | 35833.333333333336 |
| 6           | 3           | 25000.0    | 25000.0    |
| 7           | 4           | 15000.0    | 15000.0    |
| 10          | 5           | 50000.0    | 50000.0    |
| 40          | 6           | 10000.0    | 10000.0    |
| 39          | 6           | 10000.0    | 10000.0    |
| 37          | 6           | 6700.0     | 8900.0     |
| 38          | 6           | 8000.0     | 8675.0     |
| 42          | 7           | 5000.0     | 5000.0     |
| 36          | 7           | 45000.0    | 25000.0    |
| 41          | 7           | 8500.0     | 19500.0    |
| 43          | 8           | 6700.0     | 6700.0     |
| 44          | 8           | 5000.0     | 5850.0     |
| 47          | 9           | 6500.0     | 6500.0     |
| 48          | 9           | 7200.0     | 6850.0     |
| 49          | 9           | 5000.0     | 6233.333333333333 |
| 50          | 9           | 5000.0     | 5925.0     |
| 51          | 9           | 5000.0     | 5740.0     |
| 52          | 9           | 5000.0     | 5616.666666666667 |
| 45          | 9           | 6800.0     | 5785.714285714285 |
| 46          | 9           | 6600.0     | 5887.5     |

{code}

The fix seems to be to include the rows with equal ORDER BY columns in the window frame, when compute window aggregate function.


> Drill window functions return wrong results
> -------------------------------------------
>
>                 Key: DRILL-1487
>                 URL: https://issues.apache.org/jira/browse/DRILL-1487
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Neeraja
>
> Executing the following window function with the requirement to see how a given employee salary would compare to the avg(salary) in his/her position. 
> Query executes fine however returns wrong results(expect the avg(salary) to stay same for a given window (i.e position id)
> 0: jdbc:drill:zk=local> SELECT employee_id,position_id, salary, avg(salary) OVER (PARTITION BY position_id order by position_id) FROM cp.`employee.json` order by employee_id;
> +-------------+-------------+------------+------------+
> | employee_id | position_id |   salary   |   EXPR$3   |
> +-------------+-------------+------------+------------+
> | 1           | 1           | 80000.0    | 80000.0    |
> | 2           | 2           | 40000.0    | 37500.0    |
> | 4           | 2           | 40000.0    | 38333.333333333336 |
> | 5           | 2           | 35000.0    | 35000.0    |
> | 6           | 3           | 25000.0    | 25000.0    |
> | 7           | 4           | 15000.0    | 15000.0    |
> | 8           | 11          | 10000.0    | 14333.333333333334 |
> | 9           | 11          | 17000.0    | 17000.0    |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)