You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Will Du (Jira)" <ji...@apache.org> on 2020/10/12 13:19:00 UTC
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function
with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359 ]
Will Du edited comment on SPARK-33116 at 10/12/20, 1:18 PM:
------------------------------------------------------------
[~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price).
was (Author: willddy):
[~maropu], the statement is comparing query with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price).
> Spark SQL window function with order by cause result incorrect
> --------------------------------------------------------------
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.1
> Reporter: Will Du
> Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col, the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category || price || max_p || min_p || sum_p || avg_p || count_w ||
> | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 |
> | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 |
> | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 |
> | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*) over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category || price || max_p || min_p || sum_p || avg_p || count_w ||
> | drink | 8.90 | 8.90 | 8.90 | 8.90 | 8.900000 | 1 |
> | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2 |
> | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 |
> | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.000000 | 1 |
> | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy | 5.90 | 5.90 | 5.90 | 5.90 | 5.900000 | 1 |
> | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org