You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/05 08:06:36 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

AngersZhuuuu opened a new pull request #31485:
URL: https://github.com/apache/spark/pull/31485


   
   ### What changes were proposed in this pull request?
   When explain SQL with cost, treeString about subquery won't show it's statistics:
   
   How to reproduce:
   ```
   spark.sql("create table t1 using parquet as select id as a, id as b from range(1000)")
   spark.sql("create table t2 using parquet as select id as c, id as d from range(2000)")
   
   spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
   spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
   spark.sql("set spark.sql.cbo.enabled=true")
   
   spark.sql(
     """
       |WITH max_store_sales AS
       |  (SELECT max(csales) tpcds_cmax
       |  FROM (SELECT
       |    sum(b) csales
       |  FROM t1 WHERE a < 100 ) x),
       |best_ss_customer AS
       |  (SELECT
       |    c
       |  FROM t2
       |  WHERE d > (SELECT * FROM max_store_sales))
       |
       |SELECT c FROM best_ss_customer
       |""".stripMargin).explain("cost")
   ```
   Before this PR's output:
   ```
   == Optimized Logical Plan ==
   Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
   +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
      :  +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L]
      :     +- Aggregate [sum(b#4266L) AS csales#4260L]
      :        +- Project [b#4266L]
      :           +- Filter ((a#4265L < 100) AND isnotnull(a#4265L))
      :              +- Relation default.t1[a#4265L,b#4266L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
      +- Relation default.t2[c#4263L,d#4264L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
   Another case is TPC-DS q23a.
   ```
   
   After this pr:
   ```
   == Optimized Logical Plan ==
   Project [c#4481L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
   +- Filter (isnotnull(d#4482L) AND (d#4482L > scalar-subquery#4480 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
      :  +- Aggregate [max(csales#4478L) AS tpcds_cmax#4479L], Statistics(sizeInBytes=16.0 B, rowCount=1)
      :     +- Aggregate [sum(b#4484L) AS csales#4478L], Statistics(sizeInBytes=16.0 B, rowCount=1)
      :        +- Project [b#4484L], Statistics(sizeInBytes=1616.0 B, rowCount=101)
      :           +- Filter (isnotnull(a#4483L) AND (a#4483L < 100)), Statistics(sizeInBytes=2.4 KiB, rowCount=101)
      :              +- Relation[a#4483L,b#4484L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
      +- Relation[c#4481L,d#4482L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
   
   ```
   
   ### Why are the changes needed?
   Complete explain treeString's statistics
   
   ### Does this PR introduce _any_ user-facing change?
   When user use explain with cost mode, user can see subquery's statistic too.
   
   
   ### How was this patch tested?
   Working


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775635914


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39638/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774602899


   **[Test build #134978 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134978/testReport)** for PR 31485 at commit [`f0c5c0a`](https://github.com/apache/spark/commit/f0c5c0a79ae76276995372afa3d3362f6a7586b1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774586196


   **[Test build #134968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134968/testReport)** for PR 31485 at commit [`8cbb20a`](https://github.com/apache/spark/commit/8cbb20a6cf3c71b22a7fd84012230d8ca876809a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774647551


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134978/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773909514


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39506/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571993370



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        // trigger subquery's child plan stats propagation

Review comment:
       If it looks too weird, we can move this to `stringWithStats`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775898418


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135067/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775624853


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39638/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775277979


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39618/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775191582


   **[Test build #135031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135031/testReport)** for PR 31485 at commit [`a338b47`](https://github.com/apache/spark/commit/a338b475d3fe1c6770fa192358d5ec845ecee9cb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775219580


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39615/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773875429






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572123315



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      df.queryExecution.stringWithStats

Review comment:
       > can we create `explain-cbo.sql` and move this test to there?
   
   Explain will carry some local info such as 
   ```
         +- FileScan parquet default.t2[c#xL,d#xL] Batched: true, DataFilters: [isnotnull(d#xL)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/sql/core/spark..., PartitionFilters: [], PushedFilters: [IsNotNull(d)], ReadSchema: struct<c:bigint,d:bigint>
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774632284


   **[Test build #134968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134968/testReport)** for PR 31485 at commit [`8cbb20a`](https://github.com/apache/spark/commit/8cbb20a6cf3c71b22a7fd84012230d8ca876809a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775791227


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39649/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775377002


   **[Test build #135041 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135041/testReport)** for PR 31485 at commit [`29916c8`](https://github.com/apache/spark/commit/29916c899795997befc8ad8944200ceeea8b5a7d).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773931655


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39506/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775635903


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39638/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774497964


   > Looks good. Is it possible to add a test in `StatisticsCollectionSuite`?
   
   How about current UT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773875429


   **[Test build #134921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134921/testReport)** for PR 31485 at commit [`eacb182`](https://github.com/apache/spark/commit/eacb182ab26373bfd6aab8c8503919e59d208fe7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775635914


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39638/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572000061



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {

Review comment:
       > If we don't care about iteration order, `transformExpressions` should be better.
   
   Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775880547


   **[Test build #135067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135067/testReport)** for PR 31485 at commit [`6afa78b`](https://github.com/apache/spark/commit/6afa78bf5586aaf2ba64ddc94aedbe6d2fa80dfa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571539592



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,15 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        subqueryExpression.plan.stats

Review comment:
       Can you also add a comment that this is to trigger stats propagation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774615619


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39561/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572125668



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")

Review comment:
       > nit: use `withSQLConf`
   
   DOne




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572535922



##########
File path: sql/core/src/test/resources/sql-tests/results/explain-cbo.sql.out
##########
@@ -0,0 +1,80 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 5
+
+
+-- !query
+CREATE TABLE t1(a INT, b INT) USING PARQUET
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+CREATE TABLE t2(c INT, d INT) USING PARQUET
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+EXPLAIN COST WITH max_store_sales AS
+(
+  SELECT max(csales) tpcds_cmax
+  FROM (
+    SELECT sum(b) csales
+    FROM t1 WHERE a < 100
+  ) x
+),
+best_ss_customer AS
+(
+  SELECT c
+  FROM t2
+  WHERE d > (SELECT * FROM max_store_sales)
+)
+SELECT c FROM best_ss_customer
+-- !query schema
+struct<plan:string>
+-- !query output
+== Optimized Logical Plan ==
+Project [c#x], Statistics(sizeInBytes=1.0 B, rowCount=0)
++- Filter (isnotnull(d#x) AND (cast(d#x as bigint) > scalar-subquery#x [])), Statistics(sizeInBytes=1.0 B, rowCount=0)
+   :  +- Aggregate [max(csales#xL) AS tpcds_cmax#xL], Statistics(sizeInBytes=16.0 B, rowCount=1)
+   :     +- Aggregate [sum(b#x) AS csales#xL], Statistics(sizeInBytes=16.0 B, rowCount=1)
+   :        +- Project [b#x], Statistics(sizeInBytes=1.0 B, rowCount=0)
+   :           +- Filter (isnotnull(a#x) AND (a#x < 100)), Statistics(sizeInBytes=1.0 B, rowCount=0)
+   :              +- Relation[a#x,b#x] parquet, Statistics(sizeInBytes=1.0 B, rowCount=0)
+   +- Relation[c#x,d#x] parquet, Statistics(sizeInBytes=1.0 B, rowCount=0)
+
+== Physical Plan ==
+AdaptiveSparkPlan isFinalPlan=false
++- Project [c#x]
+   +- Filter (isnotnull(d#x) AND (cast(d#x as bigint) > Subquery subquery#x, [id=#x]))
+      :  +- Subquery subquery#x, [id=#x]
+      :     +- AdaptiveSparkPlan isFinalPlan=false
+      :        +- HashAggregate(keys=[], functions=[max(csales#xL)], output=[tpcds_cmax#xL])
+      :           +- HashAggregate(keys=[], functions=[partial_max(csales#xL)], output=[max#xL])
+      :              +- HashAggregate(keys=[], functions=[sum(b#x)], output=[csales#xL])
+      :                 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#x]
+      :                    +- HashAggregate(keys=[], functions=[partial_sum(b#x)], output=[sum#xL])
+      :                       +- Project [b#x]
+      :                          +- Filter (isnotnull(a#x) AND (a#x < 100))
+      :                             +- FileScan parquet default.t1[a#x,b#x] Batched: true, DataFilters: [isnotnull(a#x), (a#x < 100)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/sql/core/spark..., PartitionFilters: [], PushedFilters: [IsNotNull(a), LessThan(a,100)], ReadSchema: struct<a:int,b:int>

Review comment:
       > The Physical Plan should be normalized.
   
   Hmmm. got the usage, updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775363699


   **[Test build #135031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135031/testReport)** for PR 31485 at commit [`a338b47`](https://github.com/apache/spark/commit/a338b475d3fe1c6770fa192358d5ec845ecee9cb).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774929004


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39587/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775672022






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774901374


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39587/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571988475



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        // trigger subquery's child plan stats propagation

Review comment:
       This is weird. Doesn't EXPLAIN trigger the plan stats propagation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775392322


   **[Test build #135035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135035/testReport)** for PR 31485 at commit [`0285c88`](https://github.com/apache/spark/commit/0285c88c3ee9d9bb9988cfd2c3627e27a52b2f8d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775034585


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135004/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775650148


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39641/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775770518


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39649/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571539632



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +680,50 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      val optimizedPlan = df.queryExecution.optimizedPlan
+      optimizedPlan.stats
+      val subqueryExpression = new ArrayBuffer[SubqueryExpression]()

Review comment:
       nit `ArrayBuffer.empty[SubqueryExpression]`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775735337


   **[Test build #135067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135067/testReport)** for PR 31485 at commit [`6afa78b`](https://github.com/apache/spark/commit/6afa78bf5586aaf2ba64ddc94aedbe6d2fa80dfa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775191582


   **[Test build #135031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135031/testReport)** for PR 31485 at commit [`a338b47`](https://github.com/apache/spark/commit/a338b475d3fe1c6770fa192358d5ec845ecee9cb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775221214


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39615/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775419236






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775791227


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39649/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572518676



##########
File path: sql/core/src/test/resources/sql-tests/inputs/explain-cbo.sql
##########
@@ -0,0 +1,25 @@
+CREATE TABLE t1(a INT, b INT) USING PARQUET;
+CREATE TABLE t2(c INT, d INT) USING PARQUET;
+
+ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS;
+ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS;
+
+SET spark.sql.cbo.enabled=true;

Review comment:
       > We can add `--SET spark.sql.cbo.enabled=true` at the first line of this file.
   
   Updated




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774610085


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39561/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775635016


   **[Test build #135059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135059/testReport)** for PR 31485 at commit [`e6291bc`](https://github.com/apache/spark/commit/e6291bce1dbe0faf5f3836a13eb7530c07d420e7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774026363


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134921/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571987869



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {

Review comment:
       If we don't care about iteration order, `transformExpressions` should be better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775611880


   **[Test build #135054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135054/testReport)** for PR 31485 at commit [`b29b2dc`](https://github.com/apache/spark/commit/b29b2dc9fb581bb8ad504ae9bcfe4a9ca769808a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775635016


   **[Test build #135059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135059/testReport)** for PR 31485 at commit [`e6291bc`](https://github.com/apache/spark/commit/e6291bce1dbe0faf5f3836a13eb7530c07d420e7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775231660


   **[Test build #135035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135035/testReport)** for PR 31485 at commit [`0285c88`](https://github.com/apache/spark/commit/0285c88c3ee9d9bb9988cfd2c3627e27a52b2f8d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775231660


   **[Test build #135035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135035/testReport)** for PR 31485 at commit [`0285c88`](https://github.com/apache/spark/commit/0285c88c3ee9d9bb9988cfd2c3627e27a52b2f8d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571992802



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        // trigger subquery's child plan stats propagation

Review comment:
       > This is weird. Doesn't EXPLAIN trigger the plan stats propagation?
   
   Yea, EXPLAIN trigger plan stats before build string:
   ```
     private def stringWithStats(maxFields: Int, append: String => Unit): Unit = {
       val maxFields = SQLConf.get.maxToStringFields
   
       // trigger to compute stats for logical plans
       try {
         optimizedPlan.stats
       } catch {
         case e: AnalysisException => append(e.toString + "\n")
       }
       // only show optimized logical plan and physical plan
       append("== Optimized Logical Plan ==\n")
       QueryPlan.append(optimizedPlan, append, verbose = true, addSuffix = true, maxFields)
       append("\n== Physical Plan ==\n")
       QueryPlan.append(executedPlan, append, verbose = true, addSuffix = false, maxFields)
       append("\n")
     }
   ```
   
   Have tested that in subquery, all behavior about statistic is right since when it use statistcs,it will call `plan.stats`.
   We do it here just trigger it earlier.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775215932


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39615/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774025410


   **[Test build #134921 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134921/testReport)** for PR 31485 at commit [`eacb182`](https://github.com/apache/spark/commit/eacb182ab26373bfd6aab8c8503919e59d208fe7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774602899


   **[Test build #134978 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134978/testReport)** for PR 31485 at commit [`f0c5c0a`](https://github.com/apache/spark/commit/f0c5c0a79ae76276995372afa3d3362f6a7586b1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu edited a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu edited a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774843305


   > ```sql
   > create table t1 using parquet as select id as a, id as b, id as c from range(1000000);
   > create table t2 using parquet as select id as a, id as b, id as c from range(1000000);
   > create table t3 using parquet as select id as a, id as b, id as c from range(1000000);
   > analyze table t1 compute statistics for all columns;
   > analyze table t2 compute statistics for all columns;
   > analyze table t3 compute statistics for all columns;
   > 
   > set spark.sql.cbo.enabled=true;
   > explain cost
   > select * from t3 where c > (select max(t1.c) as tc from t1 join t2 on t1.a = t2.a and t2.b < 10);
   > ```
   
   
   After current pr the physic plan is 
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Filter (isnotnull(c#9787L) AND (c#9787L > Subquery subquery#9784, [id=#246]))
      :  +- Subquery subquery#9784, [id=#246]
      :     +- AdaptiveSparkPlan isFinalPlan=false
      :        +- HashAggregate(keys=[], functions=[max(c#9790L)], output=[tc#9783L])
      :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#244]
      :              +- HashAggregate(keys=[], functions=[partial_max(c#9790L)], output=[max#9799L])
      :                 +- Project [c#9790L]
      :                    +- BroadcastHashJoin [a#9788L], [a#9791L], Inner, BuildRight, false
      :                       :- Filter isnotnull(a#9788L)
      :                       :  +- FileScan parquet default.t1[a#9788L,c#9790L] Batched: true, DataFilters: [isnotnull(a#9788L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,c:bigint>
      :                       +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#239]
      :                          +- Project [a#9791L]
      :                             +- Filter ((isnotnull(b#9792L) AND (b#9792L < 10)) AND isnotnull(a#9791L))
      :                                +- FileScan parquet default.t2[a#9791L,b#9792L] Batched: true, DataFilters: [isnotnull(b#9792L), (b#9792L < 10), isnotnull(a#9791L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(b), LessThan(b,10), IsNotNull(a)], ReadSchema: struct<a:bigint,b:bigint>
      +- FileScan parquet default.t3[a#9785L,b#9786L,c#9787L] Batched: true, DataFilters: [isnotnull(c#9787L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c)], ReadSchema: struct<a:bigint,b:bigint,c:bigint>
   ```
   
   Change broadcast threshold
   ```
   set spark.sql.autoBroadcastJoinThreshold=10
   ```
   the plan will be 
   ```
   
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Filter (isnotnull(c#9797L) AND (c#9797L > Subquery subquery#9794, [id=#253]))
      :  +- Subquery subquery#9794, [id=#253]
      :     +- AdaptiveSparkPlan isFinalPlan=false
      :        +- HashAggregate(keys=[], functions=[max(c#9800L)])
      :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#251]
      :              +- HashAggregate(keys=[], functions=[partial_max(c#9800L)])
      :                 +- Project [c#9800L]
      :                    +- SortMergeJoin [a#9798L], [a#9801L], Inner
      :                       :- Sort [a#9798L ASC NULLS FIRST], false, 0
      :                       :  +- Exchange hashpartitioning(a#9798L, 5), ENSURE_REQUIREMENTS, [id=#243]
      :                       :     +- Filter isnotnull(a#9798L)
      :                       :        +- FileScan parquet default.t1[a#9798L,c#9800L] Batched: true, DataFilters: [isnotnull(a#9798L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,c:bigint>
      :                       +- Sort [a#9801L ASC NULLS FIRST], false, 0
      :                          +- Exchange hashpartitioning(a#9801L, 5), ENSURE_REQUIREMENTS, [id=#244]
      :                             +- Project [a#9801L]
      :                                +- Filter ((isnotnull(b#9802L) AND (b#9802L < 10)) AND isnotnull(a#9801L))
      :                                   +- FileScan parquet default.t2[a#9801L,b#9802L] Batched: true, DataFilters: [isnotnull(b#9802L), (b#9802L < 10), isnotnull(a#9801L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(b), LessThan(b,10), IsNotNull(a)], ReadSchema: struct<a:bigint,b:bigint>
      +- FileScan parquet default.t3[a#9795L,b#9796L,c#9797L] Batched: true, DataFilters: [isnotnull(c#9797L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c)], ReadSchema: struct<a:bigint,b:bigint,c:bigint>
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775380294






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572052721



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##########
@@ -253,6 +254,15 @@ class QueryExecution(
 
     // trigger to compute stats for logical plans
     try {
+      optimizedPlan.transform {

Review comment:
       shall we use `foreach` instead of `transform`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775419235






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775898418


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135067/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774619558


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39561/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774910167


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39587/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774797831


   Looks fine too. cc @maryannxue and @cloud-fan FYI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #31485:
URL: https://github.com/apache/spark/pull/31485


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572014419



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        // trigger subquery's child plan stats propagation

Review comment:
       Yea let's move it to `stringWithStats`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571777858



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +680,50 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      val optimizedPlan = df.queryExecution.optimizedPlan
+      optimizedPlan.stats
+      val subqueryExpression = ArrayBuffer.empty[SubqueryExpression]

Review comment:
       nit: `mutable.ArrayBuffer.empty[SubqueryExpression]`
   
   (since we already imported `mutable` package)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571540424



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,15 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        subqueryExpression.plan.stats

Review comment:
       > Can you also add a comment that this is to trigger stats propagation?
   
   Done

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +680,50 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      val optimizedPlan = df.queryExecution.optimizedPlan
+      optimizedPlan.stats
+      val subqueryExpression = new ArrayBuffer[SubqueryExpression]()

Review comment:
       > nit `ArrayBuffer.empty[SubqueryExpression]`
   
   Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775669180


   **[Test build #135059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135059/testReport)** for PR 31485 at commit [`e6291bc`](https://github.com/apache/spark/commit/e6291bce1dbe0faf5f3836a13eb7530c07d420e7).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774590782


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39549/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774878636


   **[Test build #135004 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135004/testReport)** for PR 31485 at commit [`89783c1`](https://github.com/apache/spark/commit/89783c18fdfae87d398a37438975843a4f64274d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774586196


   **[Test build #134968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134968/testReport)** for PR 31485 at commit [`8cbb20a`](https://github.com/apache/spark/commit/8cbb20a6cf3c71b22a7fd84012230d8ca876809a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572073126



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##########
@@ -253,6 +254,15 @@ class QueryExecution(
 
     // trigger to compute stats for logical plans
     try {
+      optimizedPlan.transform {
+        case p => p.transformExpressions {

Review comment:
       > ditto, use `p.expressions.foreach(_.foreach ...)` instead.
   
   Updated, how about current.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774878636


   **[Test build #135004 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135004/testReport)** for PR 31485 at commit [`89783c1`](https://github.com/apache/spark/commit/89783c18fdfae87d398a37438975843a4f64274d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773875429


   **[Test build #134921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134921/testReport)** for PR 31485 at commit [`eacb182`](https://github.com/apache/spark/commit/eacb182ab26373bfd6aab8c8503919e59d208fe7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775672022






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774843305


   > ```sql
   > create table t1 using parquet as select id as a, id as b, id as c from range(1000000);
   > create table t2 using parquet as select id as a, id as b, id as c from range(1000000);
   > create table t3 using parquet as select id as a, id as b, id as c from range(1000000);
   > analyze table t1 compute statistics for all columns;
   > analyze table t2 compute statistics for all columns;
   > analyze table t3 compute statistics for all columns;
   > 
   > set spark.sql.cbo.enabled=true;
   > explain cost
   > select * from t3 where c > (select max(t1.c) as tc from t1 join t2 on t1.a = t2.a and t2.b < 10);
   > ```
   
   
   After current pr the physic plan is 
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Filter (isnotnull(c#9787L) AND (c#9787L > Subquery subquery#9784, [id=#246]))
      :  +- Subquery subquery#9784, [id=#246]
      :     +- AdaptiveSparkPlan isFinalPlan=false
      :        +- HashAggregate(keys=[], functions=[max(c#9790L)], output=[tc#9783L])
      :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#244]
      :              +- HashAggregate(keys=[], functions=[partial_max(c#9790L)], output=[max#9799L])
      :                 +- Project [c#9790L]
      :                    +- BroadcastHashJoin [a#9788L], [a#9791L], Inner, BuildRight, false
      :                       :- Filter isnotnull(a#9788L)
      :                       :  +- FileScan parquet default.t1[a#9788L,c#9790L] Batched: true, DataFilters: [isnotnull(a#9788L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,c:bigint>
      :                       +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#239]
      :                          +- Project [a#9791L]
      :                             +- Filter ((isnotnull(b#9792L) AND (b#9792L < 10)) AND isnotnull(a#9791L))
      :                                +- FileScan parquet default.t2[a#9791L,b#9792L] Batched: true, DataFilters: [isnotnull(b#9792L), (b#9792L < 10), isnotnull(a#9791L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(b), LessThan(b,10), IsNotNull(a)], ReadSchema: struct<a:bigint,b:bigint>
      +- FileScan parquet default.t3[a#9785L,b#9786L,c#9787L] Batched: true, DataFilters: [isnotnull(c#9787L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c)], ReadSchema: struct<a:bigint,b:bigint,c:bigint>
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775034585


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135004/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572051494



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##########
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+    p.transformExpressionsDown {
+      case subqueryExpression: SubqueryExpression =>
+        // trigger subquery's child plan stats propagation

Review comment:
       > Yea let's move it to `stringWithStats`
   
   Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775611880


   **[Test build #135054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135054/testReport)** for PR 31485 at commit [`b29b2dc`](https://github.com/apache/spark/commit/b29b2dc9fb581bb8ad504ae9bcfe4a9ca769808a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775221214


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39615/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774590782


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39549/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774577889


   > Looks good. Is it possible to add a test in `StatisticsCollectionSuite`?
   
   How about current change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774644604


   **[Test build #134978 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134978/testReport)** for PR 31485 at commit [`f0c5c0a`](https://github.com/apache/spark/commit/f0c5c0a79ae76276995372afa3d3362f6a7586b1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773892525


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39506/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775265162


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39618/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775401954


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39624/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774026363


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134921/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572163060



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      df.queryExecution.stringWithStats

Review comment:
       AFAIK the test framework erases the location string. Can you try it out?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775735337


   **[Test build #135067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135067/testReport)** for PR 31485 at commit [`6afa78b`](https://github.com/apache/spark/commit/6afa78bf5586aaf2ba64ddc94aedbe6d2fa80dfa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775305023


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39618/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775315610


   **[Test build #135041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135041/testReport)** for PR 31485 at commit [`29916c8`](https://github.com/apache/spark/commit/29916c899795997befc8ad8944200ceeea8b5a7d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774841202


   It may just not be explained correctly, for example:
   ```sql
   create table t1 using parquet as select id as a, id as b, id as c from range(1000000);
   create table t2 using parquet as select id as a, id as b, id as c from range(1000000);
   create table t3 using parquet as select id as a, id as b, id as c from range(1000000);
   analyze table t1 compute statistics for all columns;
   analyze table t2 compute statistics for all columns;
   analyze table t3 compute statistics for all columns;
   
   set spark.sql.cbo.enabled=true;
   explain cost
   select * from t3 where c > (select max(t1.c) as tc from t1 join t2 on t1.a = t2.a and t2.b < 10);
   
   
   == Optimized Logical Plan ==
   Filter (isnotnull(c#35782L) AND (c#35782L > scalar-subquery#35774 [])), Statistics(sizeInBytes=30.5 MiB, rowCount=1.00E+6)
   :  +- Aggregate [max(c#35785L) AS tc#35773L]
   :     +- Project [c#35785L]
   :        +- Join Inner, (a#35783L = a#35786L)
   :           :- Project [a#35783L, c#35785L]
   :           :  +- Filter isnotnull(a#35783L)
   :           :     +- Relation[a#35783L,b#35784L,c#35785L] parquet
   :           +- Project [a#35786L]
   :              +- Filter ((isnotnull(b#35787L) AND (b#35787L < 10)) AND isnotnull(a#35786L))
   :                 +- Relation[a#35786L,b#35787L,c#35788L] parquet
   +- Relation[a#35780L,b#35781L,c#35782L] parquet, Statistics(sizeInBytes=30.5 MiB, rowCount=1.00E+6)
   
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Filter (isnotnull(c#35782L) AND (c#35782L > Subquery subquery#35774, [id=#201]))
      :  +- Subquery subquery#35774, [id=#201]
      :     +- AdaptiveSparkPlan isFinalPlan=false
      :        +- HashAggregate(keys=[], functions=[max(c#35785L)], output=[tc#35773L])
      :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#199]
      :              +- HashAggregate(keys=[], functions=[partial_max(c#35785L)], output=[max#35791L])
      :                 +- Project [c#35785L]
      :                    +- BroadcastHashJoin [a#35783L], [a#35786L], Inner, BuildRight, false
      :                       :- Filter isnotnull(a#35783L)
      :                       :  +- FileScan parquet default.t1[a#35783L,c#35785L] Batched: true, DataFilters: [isnotnull(a#35783L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/root/opensource/apache-spark/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,c:bigint>
      :                       +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#194]
      :                          +- Project [a#35786L]
      :                             +- Filter ((isnotnull(b#35787L) AND (b#35787L < 10)) AND isnotnull(a#35786L))
      :                                +- FileScan parquet default.t2[a#35786L,b#35787L] Batched: true, DataFilters: [isnotnull(b#35787L), (b#35787L < 10), isnotnull(a#35786L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/root/opensource/apache-spark/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(b), LessThan(b,10), IsNotNull(a)], ReadSchema: struct<a:bigint,b:bigint>
      +- FileScan parquet default.t3[a#35780L,b#35781L,c#35782L] Batched: true, DataFilters: [isnotnull(c#35782L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/root/opensource/apache-spark/spark-warehouse/t3], PartitionFilters: [], PushedFilters: [IsNotNull(c)], ReadSchema: struct<a:bigint,b:bigint,c:bigint>
   ```
   
   It should be SMJ if stats incorrect.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572208986



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      df.queryExecution.stringWithStats

Review comment:
       > AFAIK the test framework erases the location string. Can you try it out?
   
   Hmmmm, updated




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775305023


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39618/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-776410817


   thanks, merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572094831



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      df.queryExecution.stringWithStats

Review comment:
       can we create `explain-cbo.sql` and move this test to there?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774632576


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134968/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571780104



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +680,50 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")
+
+      val df = sql(
+        """
+          |WITH max_store_sales AS
+          |(
+          |  SELECT max(csales) tpcds_cmax
+          |  FROM (
+          |    SELECT sum(b) csales
+          |    FROM t1 WHERE a < 100
+          |  ) x
+          |),
+          |best_ss_customer AS
+          |(
+          |  SELECT c
+          |  FROM t2
+          |  WHERE d > (SELECT * FROM max_store_sales)
+          |)
+          |SELECT c FROM best_ss_customer
+          |""".stripMargin)
+      val optimizedPlan = df.queryExecution.optimizedPlan
+      optimizedPlan.stats
+      val subqueryExpression = ArrayBuffer.empty[SubqueryExpression]

Review comment:
       > nit: `mutable.ArrayBuffer.empty[SubqueryExpression]`
   > 
   > (since we already imported `mutable` package)
   
   Updated




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775024827


   **[Test build #135004 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135004/testReport)** for PR 31485 at commit [`89783c1`](https://github.com/apache/spark/commit/89783c18fdfae87d398a37438975843a4f64274d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774619558


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39561/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572524056



##########
File path: sql/core/src/test/resources/sql-tests/results/explain-cbo.sql.out
##########
@@ -0,0 +1,80 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 5
+
+
+-- !query
+CREATE TABLE t1(a INT, b INT) USING PARQUET
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+CREATE TABLE t2(c INT, d INT) USING PARQUET
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS
+-- !query schema
+struct<>
+-- !query output
+
+
+
+-- !query
+EXPLAIN COST WITH max_store_sales AS
+(
+  SELECT max(csales) tpcds_cmax
+  FROM (
+    SELECT sum(b) csales
+    FROM t1 WHERE a < 100
+  ) x
+),
+best_ss_customer AS
+(
+  SELECT c
+  FROM t2
+  WHERE d > (SELECT * FROM max_store_sales)
+)
+SELECT c FROM best_ss_customer
+-- !query schema
+struct<plan:string>
+-- !query output
+== Optimized Logical Plan ==
+Project [c#x], Statistics(sizeInBytes=1.0 B, rowCount=0)
++- Filter (isnotnull(d#x) AND (cast(d#x as bigint) > scalar-subquery#x [])), Statistics(sizeInBytes=1.0 B, rowCount=0)
+   :  +- Aggregate [max(csales#xL) AS tpcds_cmax#xL], Statistics(sizeInBytes=16.0 B, rowCount=1)
+   :     +- Aggregate [sum(b#x) AS csales#xL], Statistics(sizeInBytes=16.0 B, rowCount=1)
+   :        +- Project [b#x], Statistics(sizeInBytes=1.0 B, rowCount=0)
+   :           +- Filter (isnotnull(a#x) AND (a#x < 100)), Statistics(sizeInBytes=1.0 B, rowCount=0)
+   :              +- Relation[a#x,b#x] parquet, Statistics(sizeInBytes=1.0 B, rowCount=0)
+   +- Relation[c#x,d#x] parquet, Statistics(sizeInBytes=1.0 B, rowCount=0)
+
+== Physical Plan ==
+AdaptiveSparkPlan isFinalPlan=false
++- Project [c#x]
+   +- Filter (isnotnull(d#x) AND (cast(d#x as bigint) > Subquery subquery#x, [id=#x]))
+      :  +- Subquery subquery#x, [id=#x]
+      :     +- AdaptiveSparkPlan isFinalPlan=false
+      :        +- HashAggregate(keys=[], functions=[max(csales#xL)], output=[tpcds_cmax#xL])
+      :           +- HashAggregate(keys=[], functions=[partial_max(csales#xL)], output=[max#xL])
+      :              +- HashAggregate(keys=[], functions=[sum(b#x)], output=[csales#xL])
+      :                 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#x]
+      :                    +- HashAggregate(keys=[], functions=[partial_sum(b#x)], output=[sum#xL])
+      :                       +- Project [b#x]
+      :                          +- Filter (isnotnull(a#x) AND (a#x < 100))
+      :                             +- FileScan parquet default.t1[a#x,b#x] Batched: true, DataFilters: [isnotnull(a#x), (a#x < 100)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/sql/core/spark..., PartitionFilters: [], PushedFilters: [IsNotNull(a), LessThan(a,100)], ReadSchema: struct<a:int,b:int>

Review comment:
       The Physical Plan should be normalized.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775380294






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572092844



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##########
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
       }
     }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+    withTable("t1", "t2") {
+      sql("create table t1 using parquet as select id as a, id as b from range(1000)")
+      sql("create table t2 using parquet as select id as c, id as d from range(2000)")
+
+      sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+      sql("set spark.sql.cbo.enabled=true")

Review comment:
       nit: use `withSQLConf`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775756835


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39649/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775670159


   **[Test build #135054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135054/testReport)** for PR 31485 at commit [`b29b2dc`](https://github.com/apache/spark/commit/b29b2dc9fb581bb8ad504ae9bcfe4a9ca769808a).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774632576


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134968/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774929004


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39587/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572053175



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##########
@@ -253,6 +254,15 @@ class QueryExecution(
 
     // trigger to compute stats for logical plans
     try {
+      optimizedPlan.transform {
+        case p => p.transformExpressions {

Review comment:
       ditto, use `p.expressions.foreach(_.foreach ...)` instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775346548


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39624/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-SQL][34137] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-773931655


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39506/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572074133



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##########
@@ -253,6 +254,15 @@ class QueryExecution(
 
     // trigger to compute stats for logical plans
     try {
+      optimizedPlan.transform {

Review comment:
       > shall we use `foreach` instead of `transform`?
   
   Yea, seems foreach more better in here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775660961


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39641/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774497964


   > Looks good. Is it possible to add a test in `StatisticsCollectionSuite`?
   
   How about current UT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-775315610


   **[Test build #135041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135041/testReport)** for PR 31485 at commit [`29916c8`](https://github.com/apache/spark/commit/29916c899795997befc8ad8944200ceeea8b5a7d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31485:
URL: https://github.com/apache/spark/pull/31485#issuecomment-774647551


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134978/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572226095



##########
File path: sql/core/src/test/resources/sql-tests/inputs/explain-cbo.sql
##########
@@ -0,0 +1,25 @@
+CREATE TABLE t1(a INT, b INT) USING PARQUET;
+CREATE TABLE t2(c INT, d INT) USING PARQUET;
+
+ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS;
+ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS;
+
+SET spark.sql.cbo.enabled=true;

Review comment:
       We can add  `--SET spark.sql.cbo.enabled=true`  at the first line of this file.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org