You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/14 08:42:54 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

AngersZhuuuu opened a new pull request #31179:
URL: https://github.com/apache/spark/pull/31179


   ### What changes were proposed in this pull request?
   In current code, when we insert into Hive table, we only update `sizeInBytes`, won't update `rowCount` and partition's statistics.
   
   This PR collect partition's statistic and total statistic  when write data. Then use these data update metadata statistics
   
   ### Why are the changes needed?
   Update statistics is useful
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811754099


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136790/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811659269


   **[Test build #136795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136795/testReport)** for PR 31179 at commit [`70e7425`](https://github.com/apache/spark/commit/70e74254acc17ca02ba90c70a7b097b39308ee65).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811660012


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41376/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760091789


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38632/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773151680


   > how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it...
   
   In origin way.
   
   1. We just update sizeInByte wont update rowCount, then if we want to get rowCount, we need to re-run analyze command, it's a bit overhead.
   2. When update sizeInByte, in origin logical, we  need to fetch all file status under target directory. Since spark always have small file problem. It's slow especially when user's HDFS is slow.
   
   In current way:
   
   1.  We collect row count info when write data and return it to driver throw  `BasicWriteJobStatsTracker`, since we have discussed before  in https://github.com/apache/spark/pull/30026#issuecomment-709868109, we add return partition info to `BasicWriteJobStatsTracker` is not a concern. So in my pr, carry a `PartitionsStats` can't be a concern too.
   2. Just use metric data from `BasicWriteJobStatsTracker` to update statistic metadata. Now other behavior. In this way it should be faster then origin way since we don't need to fetch all file's status.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772230432


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39394/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811635011


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41373/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811659972






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811685509


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41378/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808871039


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115053


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139856/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811634186


   **[Test build #136791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136791/testReport)** for PR 31179 at commit [`be37e31`](https://github.com/apache/spark/commit/be37e3153fbf07e81f0536a83b9214063fa9704e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862229797


   **[Test build #139871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139871/testReport)** for PR 31179 at commit [`7fdf7d0`](https://github.com/apache/spark/commit/7fdf7d0ef79d78bb015eb92cc78bc0f7df607208).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772270200


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39394/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808897476


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136612/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r569154926



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##########
@@ -2581,6 +2581,59 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi
       }
     }
   }
+
+  test("xxx") {

Review comment:
       Looks like we should have a proper test name btw




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655843657



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {

Review comment:
       > I think the question was why we should separate them.
   
   Since when single partition, it return statistic data same as insert into a non-partition table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862226856


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811791013


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136791/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811660012


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41376/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808871039


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771734983


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39367/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-809020307


   Any more suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862214361


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811619794


   **[Test build #136790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136790/testReport)** for PR 31179 at commit [`31821ff`](https://github.com/apache/spark/commit/31821ffe46ab1b95a536a1a65727448a9cd47941).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811633429


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41373/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862108939


   **[Test build #139856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)** for PR 31179 at commit [`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772345392


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39402/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760030667


   **[Test build #134046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134046/testReport)** for PR 31179 at commit [`175619c`](https://github.com/apache/spark/commit/175619c1c6193082aca4c3db0521dbda3cec8358).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773113604






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811685523


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41378/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655843657



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {

Review comment:
       > I think the question was why we should separate them.
   
   Since when single partition, it return statistic data same as insert into a non-partition table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772281834


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39394/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811681712


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41378/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771660086






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772454882


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134814/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772326648


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39402/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808865653


   **[Test build #136612 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136612/testReport)** for PR 31179 at commit [`d0e27d0`](https://github.com/apache/spark/commit/d0e27d0031a095bd66ebe13ae16ba29440818191).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811634996


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41373/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811634186


   **[Test build #136791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136791/testReport)** for PR 31179 at commit [`be37e31`](https://github.com/apache/spark/commit/be37e3153fbf07e81f0536a83b9214063fa9704e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r605371258



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
       // See a new partition or bucket - write to a new partition dir (or a new bucket file).
       if (isPartitioned && currentPartitionValues != nextPartitionValues) {
         currentPartitionValues = Some(nextPartitionValues.get.copy())
-        statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+        val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+          val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)

Review comment:
       > These `UnsafeProjection` objects can be created and reused?
   
   Seems we don't need this, remove this now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811822932


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136795/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771660086


   **[Test build #134785 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134785/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773113604


   Gentle ping @cloud-fan @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773131159


   how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811822932


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136795/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811619794


   **[Test build #136790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136790/testReport)** for PR 31179 at commit [`31821ff`](https://github.com/apache/spark/commit/31821ffe46ab1b95a536a1a65727448a9cd47941).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811635011


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41373/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772210554


   **[Test build #134806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134806/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-931793103


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772281834


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39394/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772433966


   **[Test build #134814 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134814/testReport)** for PR 31179 at commit [`68e0256`](https://github.com/apache/spark/commit/68e025600fcea3a5ee65206c4f62c71effd5acdd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808865653


   **[Test build #136612 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136612/testReport)** for PR 31179 at commit [`d0e27d0`](https://github.com/apache/spark/commit/d0e27d0031a095bd66ebe13ae16ba29440818191).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811791013


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136791/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690647






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-818634558


   Any more suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772284221


   **[Test build #134814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134814/testReport)** for PR 31179 at commit [`68e0256`](https://github.com/apache/spark/commit/68e025600fcea3a5ee65206c4f62c71effd5acdd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-810305078


   The idea looks good. @viirya what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604562571



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
       // See a new partition or bucket - write to a new partition dir (or a new bucket file).
       if (isPartitioned && currentPartitionValues != nextPartitionValues) {
         currentPartitionValues = Some(nextPartitionValues.get.copy())
-        statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+        val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+          val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)

Review comment:
       These `UnsafeProjection` objects can be created and reused?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760111168






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772454882


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134814/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760096307


   **[Test build #134046 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134046/testReport)** for PR 31179 at commit [`175619c`](https://github.com/apache/spark/commit/175619c1c6193082aca4c3db0521dbda3cec8358).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `case class PartitionStats(var numFiles: Int = 0, var numBytes: Long = 0, var numRows: Long = 0) `


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772350624


   **[Test build #134806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134806/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r569161195



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##########
@@ -2581,6 +2581,59 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi
       }
     }
   }
+
+  test("xxx") {

Review comment:
       > Looks like we should have a proper test name btw
   
   Hmmmm, updated ==




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #31179:
URL: https://github.com/apache/spark/pull/31179


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862226856


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862108939


   **[Test build #139856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)** for PR 31179 at commit [`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771734983


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39367/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772345392


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39402/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604554605



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
       // See a new partition or bucket - write to a new partition dir (or a new bucket file).
       if (isPartitioned && currentPartitionValues != nextPartitionValues) {
         currentPartitionValues = Some(nextPartitionValues.get.copy())
-        statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+        val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+          val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)
+          val attrRow = proj(currentPartitionValues.get)
+          val value = if (attrRow.isNullAt(0)) {
+            null
+          } else {
+            Cast(Literal(attrRow.get(0, attr.dataType), attr.dataType),
+              StringType, Some(SQLConf.get.sessionLocalTimeZone)).eval().toString
+          }
+          attr.name -> value

Review comment:
       > this additional projection looks not good for performance.
   
   So should we not do projection here and do this in driver side when update metrics?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772204354


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808893689


   **[Test build #136612 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136612/testReport)** for PR 31179 at commit [`d0e27d0`](https://github.com/apache/spark/commit/d0e27d0031a095bd66ebe13ae16ba29440818191).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772352785


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134806/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604274317



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
       // See a new partition or bucket - write to a new partition dir (or a new bucket file).
       if (isPartitioned && currentPartitionValues != nextPartitionValues) {
         currentPartitionValues = Some(nextPartitionValues.get.copy())
-        statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+        val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+          val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)
+          val attrRow = proj(currentPartitionValues.get)
+          val value = if (attrRow.isNullAt(0)) {
+            null
+          } else {
+            Cast(Literal(attrRow.get(0, attr.dataType), attr.dataType),
+              StringType, Some(SQLConf.get.sessionLocalTimeZone)).eval().toString
+          }
+          attr.name -> value

Review comment:
       this additional projection looks not good for performance.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862224107


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44388/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690642






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r605358714



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {

Review comment:
       > Why do we need to handle the single partition case and the non-single partition case separately?
   
   yes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771707903


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39367/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-818634558


   Any more suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760111170






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773131159


   how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115008


   **[Test build #139856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)** for PR 31179 at commit [`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604513642



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(
+            newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+          val newPartitions = partition.flatten { part =>
+            val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+              CommandUtils.mergeNewStats(

Review comment:
       ditto

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -199,6 +312,17 @@ object CommandUtils extends Logging {
     newStats
   }
 
+  def mergeNewStats(

Review comment:
       private

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(

Review comment:
       `CommandUtils.mergeNewStats` -> `mergeNewStats`

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>

Review comment:
       `val spec = partitionSpec.mapValues(_.get)`

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(
+            newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+          val newPartitions = partition.flatten { part =>
+            val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+              CommandUtils.mergeNewStats(
+                part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            } else {
+              CommandUtils.compareAndGetNewStats(
+                part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            }
+            newStates.map(_ => part.copy(stats = newStates))
+          }
+          if (newTableStats.isDefined) {
+            catalog.alterTableStats(table.identifier, newTableStats)
+          }
+          if (newPartitions.nonEmpty) {
+            catalog.alterPartitions(table.identifier, newPartitions)
+          }
+        } else {
+          // update all partitions statistics
+          val partitions = statsTracker.partitionsStats.map { case (part, stats) =>
+            val partition = catalog.getPartition(table.identifier, part)
+            val newStats = Some(CatalogStatistics(
+              sizeInBytes = stats.numBytes, rowCount = Some(stats.numRows)))
+            partition.copy(stats = newStats)
+          }.toSeq
+          if (partitions.nonEmpty) {
+            catalog.alterPartitions(table.identifier, partitions)
+          }
+
+          if (isPartialPartitions) {
+            val newStats = CommandUtils.mergeNewStats(
+              newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            if (newStats.isDefined) {
+              catalog.alterTableStats(table.identifier, newStats)
+            }
+          } else {
+            val newStats = CommandUtils.compareAndGetNewStats(
+              newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            if (newStats.isDefined) {
+              catalog.alterTableStats(table.identifier, newStats)
+            }
+          }
+        }
+      } else {
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>

Review comment:
       `val spec = partitionSpec.mapValues(_.get)`

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&

Review comment:
       Could you move `isPartialPartitions` into L102?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -38,25 +39,40 @@ import org.apache.spark.util.SerializableConfiguration
  * These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
  */
 case class BasicWriteTaskStats(
-    partitions: Seq[InternalRow],
-    numFiles: Int,
-    numBytes: Long,
-    numRows: Long)
+    partitionsStats: mutable.Map[TablePartitionSpec, PartitionStats],

Review comment:
       partitionsStats -> partitionSpecWithStats

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {

Review comment:
       Why do we need to handle the single partition case and the non-single partition case separately?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(
+            newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+          val newPartitions = partition.flatten { part =>

Review comment:
       This block seems to be the same with the `overwrite=false` case? https://github.com/apache/spark/pull/31179/files#diff-6309057f8f41f20f8de513ab67d7755aae5fb30d7441fc21000999c9e8e8e0bfR125-R140




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862114525


   **[Test build #139860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139860/testReport)** for PR 31179 at commit [`7fdf7d0`](https://github.com/apache/spark/commit/7fdf7d0ef79d78bb015eb92cc78bc0f7df607208).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811685523


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41378/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772284221


   **[Test build #134814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134814/testReport)** for PR 31179 at commit [`68e0256`](https://github.com/apache/spark/commit/68e025600fcea3a5ee65206c4f62c71effd5acdd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760030667


   **[Test build #134046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134046/testReport)** for PR 31179 at commit [`175619c`](https://github.com/apache/spark/commit/175619c1c6193082aca4c3db0521dbda3cec8358).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690642






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115053


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139856/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r557215501



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -135,30 +163,35 @@ class BasicWriteJobStatsTracker(
     @transient val metrics: Map[String, SQLMetric])
   extends WriteJobStatsTracker {
 
+  @transient val partitionsStats: mutable.Map[TablePartitionSpec, PartitionStats] =
+    mutable.Map.empty
+  @transient var numFiles: Long = 0L
+  @transient var totalNumBytes: Long = 0L
+  @transient var totalNumOutput: Long = 0L

Review comment:
       Seem here need to use accumulate ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771685532


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39367/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808897476


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136612/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862214478


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811659269


   **[Test build #136795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136795/testReport)** for PR 31179 at commit [`70e7425`](https://github.com/apache/spark/commit/70e74254acc17ca02ba90c70a7b097b39308ee65).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655838827



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {

Review comment:
       I think the question was why we should separate them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655838827



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {

Review comment:
       I think the question was why we should separate them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760077095


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38632/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772352785


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134806/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811754099


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136790/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811742515


   **[Test build #136790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136790/testReport)** for PR 31179 at commit [`31821ff`](https://github.com/apache/spark/commit/31821ffe46ab1b95a536a1a65727448a9cd47941).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690647






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772316870


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39402/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811816383


   **[Test build #136795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136795/testReport)** for PR 31179 at commit [`70e7425`](https://github.com/apache/spark/commit/70e74254acc17ca02ba90c70a7b097b39308ee65).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808869217


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772210554


   **[Test build #134806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134806/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811786741


   **[Test build #136791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136791/testReport)** for PR 31179 at commit [`be37e31`](https://github.com/apache/spark/commit/be37e3153fbf07e81f0536a83b9214063fa9704e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r605370937



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(
+            newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+          val newPartitions = partition.flatten { part =>

Review comment:
       > This block seems to be the same with the `overwrite=false` case? https://github.com/apache/spark/pull/31179/files#diff-6309057f8f41f20f8de513ab67d7755aae5fb30d7441fc21000999c9e8e8e0bfR125-R140
   
   Done

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&

Review comment:
       > Could you move `isPartialPartitions` into L102?
   
   Done

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(
+            newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+          val newPartitions = partition.flatten { part =>
+            val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+              CommandUtils.mergeNewStats(
+                part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            } else {
+              CommandUtils.compareAndGetNewStats(
+                part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            }
+            newStates.map(_ => part.copy(stats = newStates))
+          }
+          if (newTableStats.isDefined) {
+            catalog.alterTableStats(table.identifier, newTableStats)
+          }
+          if (newPartitions.nonEmpty) {
+            catalog.alterPartitions(table.identifier, newPartitions)
+          }
+        } else {
+          // update all partitions statistics
+          val partitions = statsTracker.partitionsStats.map { case (part, stats) =>
+            val partition = catalog.getPartition(table.identifier, part)
+            val newStats = Some(CatalogStatistics(
+              sizeInBytes = stats.numBytes, rowCount = Some(stats.numRows)))
+            partition.copy(stats = newStats)
+          }.toSeq
+          if (partitions.nonEmpty) {
+            catalog.alterPartitions(table.identifier, partitions)
+          }
+
+          if (isPartialPartitions) {
+            val newStats = CommandUtils.mergeNewStats(
+              newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            if (newStats.isDefined) {
+              catalog.alterTableStats(table.identifier, newStats)
+            }
+          } else {
+            val newStats = CommandUtils.compareAndGetNewStats(
+              newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+            if (newStats.isDefined) {
+              catalog.alterTableStats(table.identifier, newStats)
+            }
+          }
+        }
+      } else {
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>

Review comment:
       > `val spec = partitionSpec.mapValues(_.get)`
   
   DOne

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>

Review comment:
       > `val spec = partitionSpec.mapValues(_.get)`
   
   Done

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -199,6 +312,17 @@ object CommandUtils extends Logging {
     newStats
   }
 
+  def mergeNewStats(

Review comment:
       > private
   
   Done

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(
+            newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+          val newPartitions = partition.flatten { part =>
+            val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+              CommandUtils.mergeNewStats(

Review comment:
       > ditto
   
   Done

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 
 object CommandUtils extends Logging {
 
+  def updateTableAndPartitionStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      overwrite: Boolean,
+      partitionSpec: Map[String, Option[String]],
+      statsTracker: BasicWriteJobStatsTracker): Unit = {
+    val catalog = sparkSession.sessionState.catalog
+    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+      val newTable = catalog.getTableMetadata(table.identifier)
+      val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+      val isPartialPartitions = partitionSpec.nonEmpty &&
+          partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+      if (overwrite) {
+        // Only update one partition, statsTracker.partitionsStats is empty
+        if (isSinglePartition) {
+          val spec = partitionSpec.map { case (key, value) =>
+            key -> value.get
+          }
+          val partition = catalog.listPartitions(table.identifier, Some(spec))
+          val newTableStats = CommandUtils.mergeNewStats(

Review comment:
       > `CommandUtils.mergeNewStats` -> `mergeNewStats`
   
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org