You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2020/10/12 05:46:00 UTC

[jira] [Resolved] (SPARK-33116) Spark SQL window function with order by cause result incorrect

     [ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takeshi Yamamuro resolved SPARK-33116.
--------------------------------------
    Resolution: Invalid

> Spark SQL window function with order by cause result incorrect
> --------------------------------------------------------------
>
>                 Key: SPARK-33116
>                 URL: https://issues.apache.org/jira/browse/SPARK-33116
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Will Du
>            Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
> | drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
> | drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
> | drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
> | game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
> | game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
> | toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
> | toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org