You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "wangyum (via GitHub)" <gi...@apache.org> on 2023/11/01 06:18:06 UTC

[PR] [SPARK-45755][SQL] Push down limit through Dataset.isEmpty() [spark]

wangyum opened a new pull request, #43617:
URL: https://github.com/apache/spark/pull/43617

   ### What changes were proposed in this pull request?
   
   This PR makes `Dataset.isEmpty()` to execute global limit 1 first. `LimitPushDown` may push down global limit 1 to a lower node to improve query performance. Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case: https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manual testing:
   ```scala
   spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1")
   spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2")
   println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
   ```
   
   Before this PR | After this PR
   -- | --
   <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8">
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43617:
URL: https://github.com/apache/spark/pull/43617#issuecomment-1788524697

   I revised the PR title a little, @wangyum . You can change it back if you want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer closed pull request #43617: [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1`
URL: https://github.com/apache/spark/pull/43617


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43617:
URL: https://github.com/apache/spark/pull/43617#issuecomment-1788799550

   Merged to master. Thank you @wangyum 
   Thank you @dongjoon-hyun @HyukjinKwon @yaooqinn too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45755][SQL] Improve `Dataset.isEmpty()` by applying global limit `1` [spark]

Posted by "wangyum (via GitHub)" <gi...@apache.org>.

wangyum commented on PR #43617:
URL: https://github.com/apache/spark/pull/43617#issuecomment-1788529869

   Thank you @dongjoon-hyun The new PR title looks better than the previous one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org