You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/16 03:01:12 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

zhengruifeng opened a new pull request, #34367:
URL: https://github.com/apache/spark/pull/34367

   ### What changes were proposed in this pull request?
   introduce a new node `RankLimit` to filter out uncessary rows based on rank computed on partial dataset.
   
   it supports following pattern:
   
   ```
    select (... (row_number|rank|dense_rank)() over ( [partition by ...] order by ... ) as rn)
       where rn (==|<|<=) k and other conditions
   ```
   
   For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on partitial dataset  always  <=  its final rank computed on the whole dataset，so we can safely discard rows with partitial rank > `k`, anywhere.
   
   
   
   ### Why are the changes needed?
   1, reduce the shuffle write;
   2, solve skewed-window problem, a practical case was optimized from 2.5h to 26min
   
   
   ### Does this PR introduce _any_ user-facing change?
   a new config is added
   
   
   ### How was this patch tested?
   
   1, added testsuits, practical cases on our production system
   
   2, 10TiB TPC-DS - q67:
   
   Before this PR | After this PR
   --- | ---
   Job Duration=58min|Job Duration=11min
   Stage Duration=50min|Stage Duration=3sec
   Stage Shuffle=58.0 GiB|Stage Shuffle=9.9 MiB
   ![image](https://user-images.githubusercontent.com/7322292/147652153-80890751-1c6d-4c54-8baf-1b036e829ca9.png)|![image](https://user-images.githubusercontent.com/7322292/147652272-128d3013-c2d0-4676-ab79-050d3349d0b2.png)
   ![image](https://user-images.githubusercontent.com/7322292/147808906-ed68e493-d0a3-4134-964a-a037721f4fbb.png)|![image](https://user-images.githubusercontent.com/7322292/147808939-a605f85a-bb31-49fa-9dd9-a9af23ec5df0.png)
   
   
   3, added benchmark:
   
   ```
   
   [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_301-b09 on Linux 5.11.0-41-generic
   [info] Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
   [info] Benchmark Top-K:                                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------------------
   [info] ROW_NUMBER WITHOUT PARTITION                                  10688          11377         664          2.0         509.6       1.0X
   [info] ROW_NUMBER WITHOUT PARTITION (RANKLIMIT Sorting)               2678           2962         137          7.8         127.7       4.0X
   [info] ROW_NUMBER WITHOUT PARTITION (RANKLIMIT TakeOrdered)           1585           1611          19         13.2          75.6       6.7X
   [info] RANK WITHOUT PARTITION                                        11504          12056         406          1.8         548.6       0.9X
   [info] RANK WITHOUT PARTITION (RANKLIMIT)                             3020           3148          89          6.9         144.0       3.5X
   [info] DENSE_RANK WITHOUT PARTITION                                  11728          11915         216          1.8         559.3       0.9X
   [info] DENSE_RANK WITHOUT PARTITION (RANKLIMIT)                       2632           2906         182          8.0         125.5       4.1X
   [info] ROW_NUMBER WITH PARTITION                                     23139          24025         500          0.9        1103.4       0.5X
   [info] ROW_NUMBER WITH PARTITION (RANKLIMIT Sorting)                  7034           7575         361          3.0         335.4       1.5X
   [info] ROW_NUMBER WITH PARTITION (RANKLIMIT TakeOrdered)              5958           6391         311          3.5         284.1       1.8X
   [info] RANK WITH PARTITION                                           24942          26005         795          0.8        1189.4       0.4X
   [info] RANK WITH PARTITION (RANKLIMIT)                                7217           7517         219          2.9         344.1       1.5X
   [info] DENSE_RANK WITH PARTITION                                     24843          26726         221          0.8        1184.6       0.4X
   [info] DENSE_RANK WITH PARTITION (RANKLIMIT)                          7455           7978         560          2.8         355.5       1.4X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-1261602326

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation
URL: https://github.com/apache/spark/pull/34367


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.

beliefer commented on PR #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-1316464454

   > It is a long time since I initially sent this PR, and I don't have time to work on it, if any guys are interested in this optimization, feel free to take over it. cc @beliefer
   
   OK. Let me see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-1316354088

   It is a long time since I initially sent this PR, and I don't have time to work on it, if any guys are interested in this optimization, feel free to take over it.   cc @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation
URL: https://github.com/apache/spark/pull/34367


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org