You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "n-young-db (via GitHub)" <gi...@apache.org> on 2024/01/12 04:15:56 UTC

[PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

n-young-db opened a new pull request, #44699:
URL: https://github.com/apache/spark/pull/44699

   ### What changes were proposed in this pull request?
   
   - Add LocalLimitExec to SparkStrategies in Limit + Offset cases
   - Add UT
   
   
   ### Why are the changes needed?
   
   Originally, `OffsetAndLimit` and `LimitAndOffset` match cases were matching then dropping a LocalLimit node. Adds this LocalLimitExec node to the physical plan to improve efficiency. Note that this was not a correctness bug since not applying LocalLimit only leads to larger intermediate shuffles / nodes.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #44699:
URL: https://github.com/apache/spark/pull/44699#issuecomment-1890771531

   Welcome to the Apache Spark community, @n-young-db .
   I added you to the Apache Spark contributor group and assigned SPARK-46693 to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44699:
URL: https://github.com/apache/spark/pull/44699#issuecomment-1888456760

   also cc @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "n-young-db (via GitHub)" <gi...@apache.org>.
n-young-db commented on code in PR #44699:
URL: https://github.com/apache/spark/pull/44699#discussion_r1450760704


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:
##########
@@ -1446,6 +1446,25 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("Limit and offset should not drop LocalLimitExec operator") {
+    val df = sql("SELECT * FROM (SELECT * FROM RANGE(100000) LIMIT 25 OFFSET 3) WHERE id > 1000")

Review Comment:
   @yaooqinn oops; thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #44699: [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset
URL: https://github.com/apache/spark/pull/44699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #44699:
URL: https://github.com/apache/spark/pull/44699#issuecomment-1890772211

   BTW, @n-young-db , according to the commit log, SPARK-46274 is your contribution, right?
   ```
   [SPARK-46274][SQL] Fix Range operator computeStats() to check long validity before converting
   ```
   
   It seems that it's assigned to someone-else currently. Is there any reason for this contributor mismatch, @n-young-db and @cloud-fan ?
   <img width="1093" alt="Screenshot 2024-01-13 at 1 21 44 PM" src="https://github.com/apache/spark/assets/9700541/7ffcaa29-28ce-41ec-9a41-33b2b0ca97e1">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44699:
URL: https://github.com/apache/spark/pull/44699#issuecomment-1891427238

   @dongjoon-hyun thanks for catching it! it's a mistake, I've updated the assignee.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "n-young-db (via GitHub)" <gi...@apache.org>.
n-young-db commented on PR #44699:
URL: https://github.com/apache/spark/pull/44699#issuecomment-1893000848

   @dongjoon-hyun I hadn't gotten my JIRA account yet so I had someone else create the ticket for me; thank you for catching and fixing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on code in PR #44699:
URL: https://github.com/apache/spark/pull/44699#discussion_r1449865361


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:
##########
@@ -1446,6 +1446,25 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("Limit and offset should not drop LocalLimitExec operator") {
+    val df = sql("SELECT * FROM (SELECT * FROM RANGE(100000) LIMIT 25 OFFSET 3) WHERE id > 1000")
+    df.collect()

Review Comment:
   Is the action here even necessary?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:
##########
@@ -1446,6 +1446,25 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("Limit and offset should not drop LocalLimitExec operator") {
+    val df = sql("SELECT * FROM (SELECT * FROM RANGE(100000) LIMIT 25 OFFSET 3) WHERE id > 1000")
+    df.collect()
+
+    val planned = df.queryExecution.sparkPlan
+    assert(planned.exists(_.isInstanceOf[GlobalLimitExec]))
+    assert(planned.exists(_.isInstanceOf[LocalLimitExec]))
+  }
+
+  test("Offset and limit should not drop LocalLimitExec operator") {
+    val df = sql("""SELECT * FROM (SELECT * FROM
+      (SELECT * FROM RANGE(100000) LIMIT 25) OFFSET 3) WHERE id > 1000""".stripMargin)
+    df.collect()

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #44699:
URL: https://github.com/apache/spark/pull/44699#issuecomment-1890824400

   late lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on code in PR #44699:
URL: https://github.com/apache/spark/pull/44699#discussion_r1449866973


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:
##########
@@ -1446,6 +1446,25 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("Limit and offset should not drop LocalLimitExec operator") {
+    val df = sql("SELECT * FROM (SELECT * FROM RANGE(100000) LIMIT 25 OFFSET 3) WHERE id > 1000")

Review Comment:
   Can we minimize RANGE(100000)? AFAIK, this operator is not optimized for skipping data yet



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "n-young-db (via GitHub)" <gi...@apache.org>.
n-young-db commented on code in PR #44699:
URL: https://github.com/apache/spark/pull/44699#discussion_r1449868559


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:
##########
@@ -1446,6 +1446,25 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("Limit and offset should not drop LocalLimitExec operator") {
+    val df = sql("SELECT * FROM (SELECT * FROM RANGE(100000) LIMIT 25 OFFSET 3) WHERE id > 1000")

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46693][SQL] Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on code in PR #44699:
URL: https://github.com/apache/spark/pull/44699#discussion_r1449875711


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:
##########
@@ -1446,6 +1446,25 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("Limit and offset should not drop LocalLimitExec operator") {
+    val df = sql("SELECT * FROM (SELECT * FROM RANGE(100000) LIMIT 25 OFFSET 3) WHERE id > 1000")

Review Comment:
   Thank you @n-young-db, can we also adjust the filter condition into a reasonable value in the WHERE clause?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org