You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/09/11 07:18:00 UTC

[jira] [Commented] (SPARK-32096) Improve sorting performance for Spark SQL rank window function​

    [ https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194056#comment-17194056 ] 

Apache Spark commented on SPARK-32096:
--------------------------------------

User 'xuzikun2003' has created a pull request for this issue:
https://github.com/apache/spark/pull/29725

> Improve sorting performance for Spark SQL rank window function​
> ---------------------------------------------------------------
>
>                 Key: SPARK-32096
>                 URL: https://issues.apache.org/jira/browse/SPARK-32096
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Any environment that supports Spark.
>            Reporter: Zikun
>            Priority: Major
>         Attachments: windowSortPerf (1).docx
>
>
> Spark SQL rank window function needs to sort the data in each window partition, and it relies on the execution operator[ |https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0] [*_SortExec_* |https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to do the sort. During sorting, the window partition key is also put at the front of the sort order and thus it brings unnecessary comparisons on the partition key. Instead, we can group the rows by partition key first, and inside each group we sort the rows without comparing the partition key.​ 
>  
> In Spark SQL, there are two types of sort execution, *_SortExec_* and *_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution and it does not support top-N sort. ​*_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. Spark SQL rank window function needs to sort the data locally and it relies on the execution plan *_SortExec_* to sort the data in each physical data partition. When the filter of the window rank (e.g. rank <= 100) is specified in a user's query, the filter can actually be pushed down to the SortExec and then we let SortExec operates top-N sort. Right now SortExec does not support top-N sort and we need to extend the capability of SortExec to support top-N sort. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org