You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/23 07:46:04 UTC

[GitHub] [spark] c21 opened a new pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

c21 opened a new pull request #30138:
URL: https://github.com/apache/spark/pull/30138


   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716419788






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511321486



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0
+   */
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Oh, `private[sql]` is okay too. I don't think we should expose this as public.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715772704






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716403050


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34867/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716360810


   **[Test build #130270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130270/testReport)** for PR 30138 at commit [`070da00`](https://github.com/apache/spark/commit/070da002a72ec78cca772404806c44cd2e07d6b0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716381964






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628695013



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       I feel introducing the configs to enable them (i.e. allow user to enable AQE for cached query, or allow user to enable aoth bucketed scan for cached query) is dangerous, as user can cause correctness bug to their pipeline if using them blindly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628711441



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       @ulysses-you, @maropu - sorry my bad. This and AQE is for performance only, but not correctness. Then i am find with either adding or not adding another config.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715610076


   **[Test build #130216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130216/testReport)** for PR 30138 at commit [`da54eaa`](https://github.com/apache/spark/commit/da54eaaff9483751f8b856bc9e03ef31ca698619).
    * This patch **fails Scala style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715171289


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34797/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716441272


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34870/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716345290


   **[Test build #130262 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130262/testReport)** for PR 30138 at commit [`9e6aaf4`](https://github.com/apache/spark/commit/9e6aaf4193d762e045f51b866e91cf89da753cca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715628348






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716441301






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715772704






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510732445



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {

Review comment:
       nit: to be more type safe, how about `Seq[ConfigEntry[Boolean]]`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715611786


   **[Test build #130217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130217/testReport)** for PR 30138 at commit [`09c6ca9`](https://github.com/apache/spark/commit/09c6ca9f70f90f9cd1ddd2187896ca5c485e691d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715830786


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130228/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716352824


   **[Test build #130266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130266/testReport)** for PR 30138 at commit [`9e6aaf4`](https://github.com/apache/spark/commit/9e6aaf4193d762e045f51b866e91cf89da753cca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715110635


   **[Test build #130197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130197/testReport)** for PR 30138 at commit [`7315be8`](https://github.com/apache/spark/commit/7315be8831abf07f9e761e13fd0d6812f7b2cd25).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715659061






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715200297


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34797/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716499563






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716427614


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34870/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715313658


   **[Test build #130197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130197/testReport)** for PR 30138 at commit [`7315be8`](https://github.com/apache/spark/commit/7315be8831abf07f9e761e13fd0d6812f7b2cd25).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716507682






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716360810


   **[Test build #130270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130270/testReport)** for PR 30138 at commit [`070da00`](https://github.com/apache/spark/commit/070da002a72ec78cca772404806c44cd2e07d6b0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716345290


   **[Test build #130262 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130262/testReport)** for PR 30138 at commit [`9e6aaf4`](https://github.com/apache/spark/commit/9e6aaf4193d762e045f51b866e91cf89da753cca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] marceloamaral commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
marceloamaral commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r537373835



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       I know this is old, but why all the configurations (e.g., AQE) must be disabled for CacheManager?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628719198



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Could you move this discussion into https://issues.apache.org/jira/browse/SPARK-35332 ? I thinks the jira ticket is related to this topic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511319323



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0
+   */
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       curious why we need to add this? what's the issue we are preventing? also why `private[spark]` but not `private[sql]`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628719198



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Could you join the discussion in https://issues.apache.org/jira/browse/SPARK-35332 ? I thinks the jira ticket is related to this topic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716419761


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34867/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511321724



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0
+   */
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       @viirya - sure, updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716498470


   **[Test build #130266 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130266/testReport)** for PR 30138 at commit [`9e6aaf4`](https://github.com/apache/spark/commit/9e6aaf4193d762e045f51b866e91cf89da753cca).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715620490


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34816/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715830756






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715628348






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715658861


   **[Test build #130217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130217/testReport)** for PR 30138 at commit [`09c6ca9`](https://github.com/apache/spark/commit/09c6ca9f70f90f9cd1ddd2187896ca5c485e691d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716366922


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34862/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511269228



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0
+   */
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       `private[spark]`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511747673



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -79,10 +92,11 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
     if (lookupCachedData(planToCache).nonEmpty) {
       logWarning("Asked to cache already cached data.")
     } else {
-      // Turn off AQE so that the outputPartitioning of the underlying plan can be leveraged.
-      val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.

Review comment:
       @viirya - removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715749893


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34828/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716350869






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511757506



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0

Review comment:
       @cloud-fan - removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716381949


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34862/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715628339


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34817/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715609569


   **[Test build #130216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130216/testReport)** for PR 30138 at commit [`da54eaa`](https://github.com/apache/spark/commit/da54eaaff9483751f8b856bc9e03ef31ca698619).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716350827


   **[Test build #130262 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130262/testReport)** for PR 30138 at commit [`9e6aaf4`](https://github.com/apache/spark/commit/9e6aaf4193d762e045f51b866e91cf89da753cca).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511746390



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -79,10 +92,11 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
     if (lookupCachedData(planToCache).nonEmpty) {
       logWarning("Asked to cache already cached data.")
     } else {
-      // Turn off AQE so that the outputPartitioning of the underlying plan can be leveraged.
-      val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.

Review comment:
       nit: this comment seems duplicated with above.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716352824


   **[Test build #130266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130266/testReport)** for PR 30138 at commit [`9e6aaf4`](https://github.com/apache/spark/commit/9e6aaf4193d762e045f51b866e91cf89da753cca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716688999


   Thanks @maropu , @viirya and @cloud-fan for review!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715627260


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34816/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715609569


   **[Test build #130216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130216/testReport)** for PR 30138 at commit [`da54eaa`](https://github.com/apache/spark/commit/da54eaaff9483751f8b856bc9e03ef31ca698619).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716419788






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715200285






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511321486



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0
+   */
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Oh, `private[sql]` is better. I don't think we should expose this as public.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715200285


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510773956



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716441301






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716340433


   @cloud-fan - wondering do you think if the PR is ready to go? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510733262



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala
##########
@@ -218,4 +221,27 @@ abstract class DisableUnnecessaryBucketedScanSuite extends QueryTest with SQLTes
       }
     }
   }
+
+  test("SPARK-33075: not disable bucketed table scan for cached query") {
+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")

Review comment:
       why not just `CACHE TABLE t1`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511167105



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {

Review comment:
       sure, updated, it's safer.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -55,6 +56,17 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
   @transient @volatile
   private var cachedData = IndexedSeq[CachedData]()
 
+  /**
+   * Configurations needs to be turned off, to avoid regression for cached query, so that the
+   * outputPartitioning of the underlying cached query plan can be leveraged later.
+   * Configurations include:
+   * 1. AQE
+   * 2. Automatic bucketed table scan
+   */
+  private val configsOff = Seq(

Review comment:
       @maropu - sure, updated.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Sounds good it makes sense to me, moved to `object SparkSessoin`.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -79,10 +91,11 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
     if (lookupCachedData(planToCache).nonEmpty) {
       logWarning("Asked to cache already cached data.")
     } else {
-      // Turn off AQE so that the outputPartitioning of the underlying plan can be leveraged.
-      val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
+      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(

Review comment:
       @maropu - updated.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala
##########
@@ -218,4 +221,27 @@ abstract class DisableUnnecessaryBucketedScanSuite extends QueryTest with SQLTes
       }
     }
   }
+
+  test("SPARK-33075: not disable bucketed table scan for cached query") {
+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")

Review comment:
       @cloud-fan - sure, updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715314545






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715611786


   **[Test build #130217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130217/testReport)** for PR 30138 at commit [`09c6ca9`](https://github.com/apache/spark/commit/09c6ca9f70f90f9cd1ddd2187896ca5c485e691d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715610090






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716351287


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628711183



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Is this related to correctness? I thought this was performance related because they can change output partitions implicitly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715772657


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34828/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715659061






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715829881


   **[Test build #130228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130228/testReport)** for PR 30138 at commit [`8c1d11e`](https://github.com/apache/spark/commit/8c1d11e31c815265fd3b9002b3d9ac3bd2cd6804).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715627264






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716507682






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716499563






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510740768



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala
##########
@@ -218,4 +221,27 @@ abstract class DisableUnnecessaryBucketedScanSuite extends QueryTest with SQLTes
       }
     }
   }
+
+  test("SPARK-33075: not disable bucketed table scan for cached query") {
+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")

Review comment:
       it can also save the `uncache` at the end, as the table will be dropped at the end.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716381964






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716506401


   **[Test build #130270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130270/testReport)** for PR 30138 at commit [`070da00`](https://github.com/apache/spark/commit/070da002a72ec78cca772404806c44cd2e07d6b0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715110635


   **[Test build #130197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130197/testReport)** for PR 30138 at commit [`7315be8`](https://github.com/apache/spark/commit/7315be8831abf07f9e761e13fd0d6812f7b2cd25).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715314545






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510736242



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala
##########
@@ -218,4 +221,27 @@ abstract class DisableUnnecessaryBucketedScanSuite extends QueryTest with SQLTes
       }
     }
   }
+
+  test("SPARK-33075: not disable bucketed table scan for cached query") {
+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")

Review comment:
       Either way is fine for me, if you think it's too redundant I can also change that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716350869






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715627264






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715610090


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu closed pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu closed pull request #30138:
URL: https://github.com/apache/spark/pull/30138


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628686816



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       @maropu Sorry for the late. 
   
   Can we make these disable configs more public ? i.e. expose a new SQL config to user, and the exists value as defualt.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715200267


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34797/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510816024



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -55,6 +56,17 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
   @transient @volatile
   private var cachedData = IndexedSeq[CachedData]()
 
+  /**
+   * Configurations needs to be turned off, to avoid regression for cached query, so that the
+   * outputPartitioning of the underlying cached query plan can be leveraged later.
+   * Configurations include:
+   * 1. AQE
+   * 2. Automatic bucketed table scan
+   */
+  private val configsOff = Seq(

Review comment:
       nit: How about `configsOff` -> `forceDisableConfigs`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -79,10 +91,11 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
     if (lookupCachedData(planToCache).nonEmpty) {
       logWarning("Asked to cache already cached data.")
     } else {
-      // Turn off AQE so that the outputPartitioning of the underlying plan can be leveraged.
-      val sessionWithAqeOff = getOrCloneSessionWithAqeOff(query.sparkSession)
-      val inMemoryRelation = sessionWithAqeOff.withActive {
-        val qe = sessionWithAqeOff.sessionState.executePlan(planToCache)
+      // Turn off configs so that the outputPartitioning of the underlying plan can be leveraged.
+      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(

Review comment:
       nit:  it seems we don't this line break;
   ```
         val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(query.sparkSession, configsOff)
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       Since this method is not only for AQE now, could you move this method into a more suitable place, e.g., `object SparkSessoin` or somewhere?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715710242


   **[Test build #130228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130228/testReport)** for PR 30138 at commit [`8c1d11e`](https://github.com/apache/spark/commit/8c1d11e31c815265fd3b9002b3d9ac3bd2cd6804).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715610098


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130216/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715097663


   cc @cloud-fan , @maropu and @viirya if you guys have time to take a look, thanks. This is the followup from https://github.com/apache/spark/pull/29804 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715622938


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34817/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-716485477


   Thanks! Merged to master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r628711132



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       @c21 I have such thought at first, but cann't find a negative case. Can you point out a case that can cause correctness bug ? 
   
   If I don's miss something, it just affect the perfermance about extra shuffle. For correctness, let's assuming a cache plan with AQE enabled:
   * For lazy cache. the AQE framework will ensure the correctness of the new query with the cached plan .
   * For force cache. if the output paritioning or ordering of cached plan has been affected by AQE then Spark will use `EnsureRequirements` to promise the correctness.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510997364



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       +1 move to other general object.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(
+      session: SparkSession,
+      configurations: Seq[String]): SparkSession = {

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r537488555



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala
##########
@@ -139,15 +138,21 @@ trait AdaptiveSparkPlanHelper {
   }
 
   /**
-   * Returns a cloned [[SparkSession]] with adaptive execution disabled, or the original
-   * [[SparkSession]] if its adaptive execution is already disabled.
+   * Returns a cloned [[SparkSession]] with all specified configurations disabled, or
+   * the original [[SparkSession]] if all configurations are already disabled.
    */
-  def getOrCloneSessionWithAqeOff[T](session: SparkSession): SparkSession = {
-    if (!session.sessionState.conf.adaptiveExecutionEnabled) {
+  def getOrCloneSessionWithConfigsOff(

Review comment:
       That's because performance regression can happen. Could you check the previous discussion, e.g., https://github.com/apache/spark/pull/29804#issuecomment-701110070 ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r511754456



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -1077,6 +1077,27 @@ object SparkSession extends Logging {
       throw new IllegalStateException("No active or default Spark session found")))
   }
 
+  /**
+   * Returns a cloned SparkSession with all specified configurations disabled, or
+   * the original SparkSession if all configurations are already disabled.
+   *
+   * @since 3.1.0

Review comment:
       this is not needed for internal APIs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715830756


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30138:
URL: https://github.com/apache/spark/pull/30138#issuecomment-715710242


   **[Test build #130228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130228/testReport)** for PR 30138 at commit [`8c1d11e`](https://github.com/apache/spark/commit/8c1d11e31c815265fd3b9002b3d9ac3bd2cd6804).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30138: [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30138:
URL: https://github.com/apache/spark/pull/30138#discussion_r510740509



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala
##########
@@ -218,4 +221,27 @@ abstract class DisableUnnecessaryBucketedScanSuite extends QueryTest with SQLTes
       }
     }
   }
+
+  test("SPARK-33075: not disable bucketed table scan for cached query") {
+    withTable("t1") {
+      withSQLConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED.key -> "true") {
+        df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
+        sql("CACHE TABLE tempTable AS SELECT i FROM t1")

Review comment:
       yea let's be simpler.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org