You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2020/10/12 05:46:00 UTC
[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect

    [ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212130#comment-17212130 ] 

Takeshi Yamamuro commented on SPARK-33116:
------------------------------------------

In SQL, they are different, I think. The statement below was cited from the PostgreSQL doc:
{code:java}
since there is no ORDER BY in the OVER clause, the window frame is the same as the partition, which for lack of PARTITION BY is the whole table; in other words each sum is taken over the whole table and so we get the same result for each output row. But if we add an ORDER BY clause, we get very different results:...
{code}
[https://www.postgresql.org/docs/current/tutorial-window.html]

> Spark SQL window function with order by cause result incorrect
> --------------------------------------------------------------
>
>                 Key: SPARK-33116
>                 URL: https://issues.apache.org/jira/browse/SPARK-33116
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Will Du
>            Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p           || count_w   ||
> | drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
> | drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
> | drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
> | game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
> | game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
> | toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
> | toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org