You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Will Du (Jira)" <ji...@apache.org> on 2020/10/12 02:53:00 UTC

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

     [ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Will Du updated SPARK-33116:
----------------------------
    Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
| drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
| drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
| drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
| game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
| game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
| toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
| toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
7 rows selected (0.436 seconds)

  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct.
SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
| drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
| drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
| drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
| game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
| game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
| toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
| toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
7 rows selected (0.436 seconds)


> Spark SQL window function with order by cause result incorrect
> --------------------------------------------------------------
>
>                 Key: SPARK-33116
>                 URL: https://issues.apache.org/jira/browse/SPARK-33116
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Will Du
>            Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
> | drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
> | drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
> | drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
> | game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
> | game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
> | toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
> | toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
> 7 rows selected (0.436 seconds)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org