You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/21 14:08:47 UTC

[GitHub] [spark] beliefer opened a new pull request, #38745: [WIP][SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows

beliefer opened a new pull request, #38745:
URL: https://github.com/apache/spark/pull/38745

   ### What changes were proposed in this pull request?
   Sometimes, the SQL exists filter which condition compares rank-like window functions with number. For example,
   ```
   SELECT *,
            ROW_NUMBER() OVER(ORDER BY a) AS rn
   FROM Tab1
   WHERE rn <= 5
   ```
   We can create a `Limit(5)` and push down it as the child of `Window`.
   ```
   SELECT *,
            ROW_NUMBER() OVER(ORDER BY a) AS rn
   FROM 
       (SELECT *
       FROM Tab1
       ORDER BY  a LIMIT 5) t
   ```
   
   In short, it supports following pattern:
   ```
   SELECT (... (row_number|rank|dense_rank)()
       OVER (
   ORDER BY  ... ) AS rn)
   WHERE rn (==|<|<=) k
           AND other conditions
   ```
   For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on dataset always <= its total rows of whole dataset,so we can safely discard rows with rank > k, anywhere.
   
   This PR also take over some functions from https://github.com/apache/spark/pull/34367.
   
   
   ### Why are the changes needed?
   Improve the performance.
   
   **Micro Benchmark**
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Just update the inner implementation.
   
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #38745: [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #38745:
URL: https://github.com/apache/spark/pull/38745#issuecomment-1324464202

   ping @zhengruifeng cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #38745: [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows

Posted by GitBox <gi...@apache.org>.
beliefer commented on code in PR #38745:
URL: https://github.com/apache/spark/pull/38745#discussion_r1031206732


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala:
##########
@@ -87,7 +88,8 @@ case class WindowExec(
     windowExpression: Seq[NamedExpression],
     partitionSpec: Seq[Expression],
     orderSpec: Seq[SortOrder],
-    child: SparkPlan)
+    child: SparkPlan,
+    groupLimitInfo: Option[(Int, Expression)] = None)

Review Comment:
   I think it is OK to add a physical node, but the amount of code is a little large, and the filtering and reduction of data occur a little late.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #38745: [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #38745:
URL: https://github.com/apache/spark/pull/38745#issuecomment-1327276041

   This PR has been replaced by https://github.com/apache/spark/pull/38799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer closed pull request #38745: [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows

Posted by GitBox <gi...@apache.org>.
beliefer closed pull request #38745: [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows
URL: https://github.com/apache/spark/pull/38745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38745: [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38745:
URL: https://github.com/apache/spark/pull/38745#discussion_r1031159165


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala:
##########
@@ -87,7 +88,8 @@ case class WindowExec(
     windowExpression: Seq[NamedExpression],
     partitionSpec: Seq[Expression],
     orderSpec: Seq[SortOrder],
-    child: SparkPlan)
+    child: SparkPlan,
+    groupLimitInfo: Option[(Int, Expression)] = None)

Review Comment:
   I think it's overkill and very risky to make invasive changes to a fundamental physical operator like `WindownExec`. I like https://github.com/apache/spark/pull/34367 more which adds a new physical node. Can you elaborate on why is this better than https://github.com/apache/spark/pull/34367 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org