You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/28 12:53:00 UTC

[GitHub] [spark] cloud-fan commented on pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

cloud-fan commented on PR #38799:
URL: https://github.com/apache/spark/pull/38799#issuecomment-1329040563

This PR basically adds a per-window-group limit before and after the shuffle to reduce the input data of window processing.

More specifically, the before-shuffle per-window-group limit:
1. adds an extra local sort to determine window group boundaries
2. applies per-group limit to reduce the data size of shuffle, and all the downstream operators.

This is beneficial, assuming the per-group data size is large. Otherwise, the extra local sort is pure overhead.

The after-shuffle per-window-group limit just applies a per-group limit to reduce the data size of window processing. This is very cheap as it only needs to iterate the sorted data (window operator needs to sort the input) once and do some row comparison to determine group boundaries and rank values. It's more efficient to merge it into the window operator, but probably doesn't worth it as the overhead is small.

I think the key here is to make sure the before-shuffle per-group limit is very unlikely to introduce regressions. It's very hard to know the per-group data size ahead of time (can CBO help?), and we need some heuristics to trigger it. Some thoughts:
1. shall we add a config as the threshold of the limit? e.g. we can set the config as 10 and then `where rn = 11` won't trigger it.
2. shall we have a special sort that stops earlier when the input is very distinct? This means the per-group data size is small and we shouldn't do this optimization.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org