You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ulysses-you (via GitHub)" <gi...@apache.org> on 2023/03/13 04:30:42 UTC

[GitHub] [spark] ulysses-you opened a new pull request, #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

ulysses-you opened a new pull request, #40390:
URL: https://github.com/apache/spark/pull/40390

### What changes were proposed in this pull request?

This pr enables the `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` by default.

### Why are the changes needed?

We have fixed all known issues when enable cache + AQE since SPARK-42101. There is no reason to skip AQE optimizing cached plan.

### Does this PR introduce _any_ user-facing change?

yes, the default config changed

### How was this patch tested?

Pass CI

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1300808158


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala:
##########
@@ -395,7 +395,10 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
    */
   private def getOrCloneSessionWithConfigsOff(session: SparkSession): SparkSession = {
     if (session.conf.get(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING)) {
-      session
+      // Bucketed scan only has one time overhead but can have multi-times benefits in cache,
+      // so we always do bucketed scan in a cached plan.
+      SparkSession.getOrCloneSessionWithConfigsOff(
+        session, SQLConf.AUTO_BUCKETED_SCAN_ENABLED :: Nil)

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1732857566

   The benchmark should either disable the conf, or use AQE-aware utils to collect plans (See `AdaptiveSparkPlanHelper`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1465534836

   thank you @dongjoon-hyun , I'm fine to hold on this until next release. Another thought is I want to make sure all tests can be passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-42768][SQL] Enable cached plan apply AQE by default [spark]

Posted by "revans2 (via GitHub)" <gi...@apache.org>.

revans2 commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1767189848

   I just noticed that the docs for CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING were not updated and still say that it is disabled by default. Should I file a new JIRA for this bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1526859862

   Hi @dongjoon-hyun , this config was added in Spark 3.2 and we have fixed all the known regressions, I think it's time to turn it on by default in 3.5 to improve the AQE coverage. [SPARK-42101](https://issues.apache.org/jira/browse/SPARK-42101) was only for the first query access and doesn't matter that much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-42768][SQL] Enable cached plan apply AQE by default [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1767377052

   Lemme make a quick fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1527065705

   sure, thank you @dongjoon-hyun @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186973365


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -241,7 +241,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {
     withTable("t1") {
-      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true",
+          SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") {

Review Comment:
   can you explain the reason a little bit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1731860178

   @ulysses-you I found that after this PR is merged, `InMemoryColumnarBenchmark` will fail to execute.
   
   ```
   build/sbt "sql/Test/runMain org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark"
   ```
   
   ```
   [error] Exception in thread "main" java.lang.IndexOutOfBoundsException: 0
   [error] 	at scala.collection.LinearSeqOps.apply(LinearSeq.scala:131)
   [error] 	at scala.collection.LinearSeqOps.apply$(LinearSeq.scala:128)
   [error] 	at scala.collection.immutable.List.apply(List.scala:79)
   [error] 	at org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark$.intCache(InMemoryColumnarBenchmark.scala:47)
   [error] 	at org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark$.$anonfun$runBenchmarkSuite$1(InMemoryColumnarBenchmark.scala:68)
   [error] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
   [error] 	at org.apache.spark.benchmark.BenchmarkBase.runBenchmark(BenchmarkBase.scala:42)
   [error] 	at org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark$.runBenchmarkSuite(InMemoryColumnarBenchmark.scala:68)
   [error] 	at org.apache.spark.benchmark.BenchmarkBase.main(BenchmarkBase.scala:72)
   [error] 	at org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark.main(InMemoryColumnarBenchmark.scala)
   [error] Nonzero exit code returned from runner: 1
   [error] (sql / Test / runMain) Nonzero exit code returned from runner: 1
   ```
   
   Should we run `InMemoryColumnarBenchmark` with the configuration `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning=false`?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186973232


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantSortsSuite.scala:
##########
@@ -124,6 +124,7 @@ abstract class RemoveRedundantSortsSuiteBase
   test("cached sorted data doesn't need to be re-sorted") {
     withSQLConf(SQLConf.REMOVE_REDUNDANT_SORTS_ENABLED.key -> "true") {
       val df = spark.range(1000).select($"id" as "key").sort($"key".desc).cache()
+      df.collect()

Review Comment:
   Got it, we need to trigger AQE execution to get the final plan if the tests check the physical plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186998228


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -241,7 +241,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {
     withTable("t1") {
-      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true",
+          SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") {

Review Comment:
   The behavior is in the range of design, see
   https://github.com/apache/spark/blob/d157b2d9a71c8370612342b298f0e6757c9b9b78/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L61-L70
   
   If we do not want to affect bucket scan, we need to split this configuration with one for AQE and one for bucket scan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1687675169

   thanks, merging to master/3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1299551915


##########
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:
##########
@@ -512,6 +512,9 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
    * Verifies that the plan for `df` contains `expected` number of Exchange operators.
    */
   private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = {
+    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+      df.collect()
+    }

Review Comment:
   Can we add a comment to explain why we need to force shuffle join here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1302533010


##########
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala:
##########
@@ -279,34 +280,36 @@ class DatasetCacheSuite extends QueryTest
     val df2 = Seq(2 -> 2).toDF("i", "j")
     val df3 = Seq(3 -> 3).toDF("i", "j")
 
-    withClue("positive: union by position") {
-      val unionDf = df1.union(df2).select($"i")
-      unionDf.cache()
-      val finalDf = unionDf.union(df3.select($"i"))
-      assert(finalDf.queryExecution.executedPlan.exists(_.isInstanceOf[InMemoryTableScanExec]))
-    }
+    withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") {
+      withClue("positive: union by position") {
+        val unionDf = df1.union(df2).select($"i")
+        unionDf.cache()
+        val finalDf = unionDf.union(df3.select($"i"))
+        assert(finalDf.queryExecution.executedPlan.exists(_.isInstanceOf[InMemoryTableScanExec]))

Review Comment:
   shall we add a helper function `def exists` for AQE plan checking, so that we don't need to disable `CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1606334030

   Thank you for providing the up-to-date context, @ulysses-you ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1184634927


##########
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:
##########
@@ -837,7 +840,7 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
       assert(getNumInMemoryRelations(ds) == 2)
 
       val cachedDs = sql(sqlText).cache()
-      assert(getNumInMemoryTablesRecursively(cachedDs.queryExecution.sparkPlan) == 3)
+      assert(getNumInMemoryTablesRecursively(cachedDs.queryExecution.sparkPlan) == 4)

Review Comment:
   Is it fixable? `TableCacheQueryStage` can add the subqueries from the `AdaptiveSparkPlanExec` below it to the current AQE context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1187016592


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -241,7 +241,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {
     withTable("t1") {
-      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true",
+          SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") {

Review Comment:
   My propose is, instead of turning off `AUTO_BUCKETED_SCAN_ENABLED` for cached query, can we update `DisableUnnecessaryBucketedScan` to not disable bucket scan for queries inside the table cache stage?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1526962876

   Of course, I agree with you, @cloud-fan . Thank you for pinging me.
   
   To @ulysses-you , could you rebase this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default
URL: https://github.com/apache/spark/pull/40390


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-42768][SQL] Enable cached plan apply AQE by default [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1767381511

   https://github.com/apache/spark/pull/43411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1526963017

   Also, cc @sunchao too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1180112895


##########
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:
##########
@@ -512,6 +512,9 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
    * Verifies that the plan for `df` contains `expected` number of Exchange operators.
    */
   private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = {
+    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+      df.collect()
+    }

Review Comment:
   avoid AQE convert first acess join to bhj



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1605811101

   thank you @dongjoon-hyun for the reminder. There is a issue https://github.com/apache/spark/pull/41100 before this pr. I hope both of them can be shipped into Spark 3.5.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1133456778


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1493,15 +1493,14 @@ object SQLConf {
 
   val CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING =
     buildConf("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning")
-      .internal()

Review Comment:
   BTW, you don't need to expose this. As you see in the most legacy configs, `.internal()` is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1180117341


##########
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:
##########
@@ -837,7 +840,7 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
       assert(getNumInMemoryRelations(ds) == 2)
 
       val cachedDs = sql(sqlText).cache()
-      assert(getNumInMemoryTablesRecursively(cachedDs.queryExecution.sparkPlan) == 3)
+      assert(getNumInMemoryTablesRecursively(cachedDs.queryExecution.sparkPlan) == 4)

Review Comment:
   this is a interesting case, AQE can not reuse subquery accross two AQE contexts, but it seems really a corner case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186665749


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantSortsSuite.scala:
##########
@@ -124,6 +124,7 @@ abstract class RemoveRedundantSortsSuiteBase
   test("cached sorted data doesn't need to be re-sorted") {
     withSQLConf(SQLConf.REMOVE_REDUNDANT_SORTS_ENABLED.key -> "true") {
       val df = spark.range(1000).select($"id" as "key").sort($"key".desc).cache()
+      df.collect()

Review Comment:
   We propagate output partitioning and ordering when the inside AQE is `isFinalPlan`, so we should materialzed it first. For these tests, it's equal to add `resorted.collect()`.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186986198


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -241,7 +241,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {
     withTable("t1") {
-      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true",
+          SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") {

Review Comment:
   I think this does expose an issue. It means bucketed table scan inside table cache is always useless because of `DisableUnnecessaryBucketedScan`. Shall we update `DisableUnnecessaryBucketedScan` to take table cache into account?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1300982280


##########
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:
##########
@@ -512,6 +512,9 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
    * Verifies that the plan for `df` contains `expected` number of Exchange operators.
    */
   private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = {
+    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+      df.collect()
+    }

Review Comment:
   oh, I missed this comment. You will find this method only checks the `ShuffleExchangeExec` if you are looking at the next line. So we should force shuffle join here to keep the original test idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1300804182


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -244,7 +244,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {

Review Comment:
   addressed



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala:
##########
@@ -395,7 +395,10 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
    */
   private def getOrCloneSessionWithConfigsOff(session: SparkSession): SparkSession = {
     if (session.conf.get(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING)) {
-      session
+      // Bucketed scan only has one time overhead but can have multi-times benefits in cache,
+      // so we always do bucketed scan in a cached plan.
+      SparkSession.getOrCloneSessionWithConfigsOff(
+        session, SQLConf.AUTO_BUCKETED_SCAN_ENABLED :: Nil)

Review Comment:
   addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186975958


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -241,7 +241,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {
     withTable("t1") {
-      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true",
+          SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") {

Review Comment:
   it's not about AQE. It affects the rule `DisableUnnecessaryBucketedScan`. If we enable `CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING` then the rule `DisableUnnecessaryBucketedScan` will take effect and change output partitioning (break bucket scan)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1186657979


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantSortsSuite.scala:
##########
@@ -124,6 +124,7 @@ abstract class RemoveRedundantSortsSuiteBase
   test("cached sorted data doesn't need to be re-sorted") {
     withSQLConf(SQLConf.REMOVE_REDUNDANT_SORTS_ENABLED.key -> "true") {
       val df = spark.range(1000).select($"id" as "key").sort($"key".desc).cache()
+      df.collect()

Review Comment:
   why do we need to trigger table cache here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1600948428

   Hi, @cloud-fan and @ulysses-you . What is the current status of this PR for Apache Spark 3.5.0? If you believe it's ready, feel free to fix the conflicts and merge this. I trust your decision.
   
   > this config was added in Spark 3.2 and we have fixed all the known regressions, I think it's time to turn it on by default in 3.5 to improve the AQE coverage. [SPARK-42101](https://issues.apache.org/jira/browse/SPARK-42101) was only for the first query access and doesn't matter that much.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1688904895

   Thank you, @ulysses-you and @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1731959966

   Thanks. Ya, I also was tracking that, @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1299552593


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala:
##########
@@ -244,7 +244,8 @@ abstract class DisableUnnecessaryBucketedScanSuite
 
   test("SPARK-33075: not disable bucketed table scan for cached query") {

Review Comment:
   I think this test is useless now, as we purposely go against it now. Shall we remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #40390:
URL: https://github.com/apache/spark/pull/40390#discussion_r1299657447


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala:
##########
@@ -395,7 +395,10 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
    */
   private def getOrCloneSessionWithConfigsOff(session: SparkSession): SparkSession = {
     if (session.conf.get(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING)) {
-      session
+      // Bucketed scan only has one time overhead but can have multi-times benefits in cache,
+      // so we always do bucketed scan in a cached plan.
+      SparkSession.getOrCloneSessionWithConfigsOff(
+        session, SQLConf.AUTO_BUCKETED_SCAN_ENABLED :: Nil)

Review Comment:
   Please update the method description too according to this code change.
   
   https://github.com/apache/spark/blob/a21e19b6e7ac4c4c77b39d93a2da2cbe1c88c4c8/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L394



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org