You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Will Du (Jira)" <ji...@apache.org> on 2020/10/12 02:52:00 UTC

[jira] [Created] (SPARK-33116) Spark SQL window function with order by cause result incorrect

Will Du created SPARK-33116:
-------------------------------

             Summary: Spark SQL window function with order by cause result incorrect
                 Key: SPARK-33116
                 URL: https://issues.apache.org/jira/browse/SPARK-33116
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.1
            Reporter: Will Du


Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct.
SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
| drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
| drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
| drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
| game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
| game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
| toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
| toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
7 rows selected (0.436 seconds)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org