You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wangyum (via GitHub)" <gi...@apache.org> on 2023/11/25 06:35:57 UTC

[PR] [SPARK-46097][SQL] Push down limit 1 though Union and Aggregate [spark]

wangyum opened a new pull request, #44009:
URL: https://github.com/apache/spark/pull/44009

   ### What changes were proposed in this pull request?
   
   This PR enhances `LimitPushDown` to support push down limit 1 though Union and Aggregate.
   
   ### Why are the changes needed?
   
   Eliminate `Aggregate` to improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test and benchmark test.
   
   ```scala
   spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1")
   spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2")
   println(spark.sql("SELECT DISTINCT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION ALL SELECT DISTINCT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
   ```
   
   Before this PR | After this PR
   -- | --
   <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/6bca6a7b-440d-49ba-a12a-acb611085862"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/44bec0b7-b709-4cd4-aedb-ae2d8b6317a5">
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46097][SQL] Push down limit 1 though Union and Aggregate [spark]

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on code in PR #44009:
URL: https://github.com/apache/spark/pull/44009#discussion_r1405299999


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -771,6 +771,17 @@ object LimitPushDown extends Rule[LogicalPlan] {
       Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
     case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if a.groupOnly =>
       Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, a.child))))
+    // Push down limit 1 though Union and Aggregate
+    case Limit(le @ IntegerLiteral(1), u: Union) =>
+      val newUnionChildren = u.children.map {
+        case a: Aggregate if a.groupOnly =>

Review Comment:
   We have supported this before:
   https://github.com/apache/spark/blob/ddb36f9e0151d3bff8b9c8a2e3bb14a6f6ece218/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L769-L773



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46097][SQL] Push down limit 1 though Union and Aggregate [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44009:
URL: https://github.com/apache/spark/pull/44009#discussion_r1404844214


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushdownSuite.scala:
##########
@@ -352,4 +352,21 @@ class LimitPushdownSuite extends PlanTest {
       comparePlans(Optimize.execute(originalQuery2), originalQuery2)
     }
   }
+
+  test("SPARK-46097: Push down limit 1 though Union and Aggregate") {
+    val unionQuery = Union(
+      Union(
+        testRelation.groupBy($"a", $"b")($"a", $"b"),
+        testRelation2.groupBy($"d", $"e")($"d", $"e"),
+      ),
+      testRelation2.groupBy($"e", $"f")($"e", $"f")).limit(1)
+
+    val correctAnswer = Union(
+      Union(
+        LocalLimit(1, testRelation).select($"a", $"b"),
+        LocalLimit(1, testRelation2).select($"d", $"e")).limit(1),
+      LocalLimit(1, testRelation2).select($"e", $"f")).limit(1)
+
+    comparePlans(Optimize.execute(unionQuery.analyze), correctAnswer.analyze)

Review Comment:
   This test case only ensure the logical plan. Could you add a test case to compare the output data?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -771,6 +771,17 @@ object LimitPushDown extends Rule[LogicalPlan] {
       Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
     case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if a.groupOnly =>
       Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, a.child))))
+    // Push down limit 1 though Union and Aggregate
+    case Limit(le @ IntegerLiteral(1), u: Union) =>
+      val newUnionChildren = u.children.map {
+        case a: Aggregate if a.groupOnly =>

Review Comment:
   I doubt that the output will be changed.
   `limit 1 after shuffle` is not the same as `limit 1 before shuffle`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46097][SQL] Push down limit 1 through Union and Aggregate [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #44009:
URL: https://github.com/apache/spark/pull/44009#issuecomment-1954636174

   Will this PR continue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org