You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/03 10:19:58 UTC

[GitHub] [spark] wankunde opened a new pull request, #38496: [SPARK-40708] Auto update table statistics based on write metrics

wankunde opened a new pull request, #38496:
URL: https://github.com/apache/spark/pull/38496

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   Update table size and rowCount based on spark write metrics
   
   ### Why are the changes needed?
   
   Auto update table stats after write job finished.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Add UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jackylee-ch commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
jackylee-ch commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1337252351

   > Support partition statistics?
   
   @melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by "melin (via GitHub)" <gi...@apache.org>.
melin commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1521072841

   > @melin I have opened another pr #39114 for this, and unfortunately that one was closed because of no feedback for a long time. If necessary, I can consider reopening it.
   
   reopen it,thank you
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1520984846

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
melin commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1337210022

   Support partition statistics?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1016516581


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2456,6 +2456,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val AUTO_UPDATE_BASED_ON_METRICS_ENABLED =

Review Comment:
   `AUTO_UPDATE_STATS_BASED_ON_METRICS_ENABLED`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
wankunde commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1017597979


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -53,19 +53,33 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 object CommandUtils extends Logging {
 
   /** Change statistics after changing data by commands. */
-  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    val catalog = sparkSession.sessionState.catalog
-    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+  def updateTableStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      wroteStats: Option[WriteStats] = None): Unit = {
+    val sessionState = sparkSession.sessionState
+    val catalog = sessionState.catalog
+    if (sessionState.conf.autoSizeUpdateEnabled) {
       val newTable = catalog.getTableMetadata(table.identifier)
       val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable)
-      val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true)
+      val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes)

Review Comment:
   Revert this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jackylee-ch commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
jackylee-ch commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1014966247


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -53,19 +53,41 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 object CommandUtils extends Logging {
 
   /** Change statistics after changing data by commands. */
-  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    val catalog = sparkSession.sessionState.catalog
-    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+  def updateTableStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      wroteStats: Option[WriteStats] = None): Unit = {
+    val sessionState = sparkSession.sessionState
+    val catalog = sessionState.catalog
+    if (sessionState.conf.autoSizeUpdateEnabled) {
       val newTable = catalog.getTableMetadata(table.identifier)
       val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable)
-      val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true)
+      val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes)
       if (isNewStats) {
         val newStats = CatalogStatistics(sizeInBytes = newSize)
         catalog.alterTableStats(table.identifier, Some(newStats))
       }
       if (newPartitions.nonEmpty) {
         catalog.alterPartitions(table.identifier, newPartitions)
       }
+    } else if (sessionState.conf.autoUpdateBasedOnMetricsEnabled && wroteStats.nonEmpty) {
+      val stat = wroteStats.get
+      stat.mode match {
+        case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
+          catalog.alterTableStats(table.identifier,
+            Some(CatalogStatistics(stat.numBytes, stat.numRows)))

Review Comment:
   Hm, we should consider about partition Statistics here. If we overwrite the part of the partitions, it would get wrong table statistcs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
melin commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1339572472

   > > Support partition statistics?
   > 
   > @melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics.
   
   Can consider the table or partition statistics released, the user can listen to these statistics, convenient display, before the magic code to obtain statistics, not very standard.
   ![image](https://user-images.githubusercontent.com/1145830/205956341-9fd93383-6316-4d0b-b56d-a5a92309d19a.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jackylee-ch commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by "jackylee-ch (via GitHub)" <gi...@apache.org>.
jackylee-ch commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1521041727

   @melin I have opened another pr #39114 for this, and unfortunately that one was closed because of no feedback for a long time. If necessary, I can consider reopening it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics
URL: https://github.com/apache/spark/pull/38496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
wankunde commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1356027243

   Hi, @jackylee-ch @melin any update ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
wankunde commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1017615989


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -53,19 +53,41 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 object CommandUtils extends Logging {
 
   /** Change statistics after changing data by commands. */
-  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    val catalog = sparkSession.sessionState.catalog
-    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+  def updateTableStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      wroteStats: Option[WriteStats] = None): Unit = {
+    val sessionState = sparkSession.sessionState
+    val catalog = sessionState.catalog
+    if (sessionState.conf.autoSizeUpdateEnabled) {
       val newTable = catalog.getTableMetadata(table.identifier)
       val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable)
-      val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true)
+      val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes)
       if (isNewStats) {
         val newStats = CatalogStatistics(sizeInBytes = newSize)
         catalog.alterTableStats(table.identifier, Some(newStats))
       }
       if (newPartitions.nonEmpty) {
         catalog.alterPartitions(table.identifier, newPartitions)
       }
+    } else if (sessionState.conf.autoUpdateBasedOnMetricsEnabled && wroteStats.nonEmpty) {
+      val stat = wroteStats.get
+      stat.mode match {
+        case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
+          catalog.alterTableStats(table.identifier,
+            Some(CatalogStatistics(stat.numBytes, stat.numRows)))

Review Comment:
   @LuciferYang  Good idea,  update the code,  prefer to use wrote stats to update non-partition table statistics if possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1303256436

   cc @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1302809709

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
wankunde commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1324001070

   Retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
wankunde commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1379972144

   @LuciferYang @jackylee-ch Could you help to review this PR again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1381349599

   @wankunde cloud you resolve the conflicts?
   
   @wangyum Does this feature need to be finished in Spark 3.4.0?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
wankunde commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1015410975


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -53,19 +53,41 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 object CommandUtils extends Logging {
 
   /** Change statistics after changing data by commands. */
-  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    val catalog = sparkSession.sessionState.catalog
-    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+  def updateTableStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      wroteStats: Option[WriteStats] = None): Unit = {
+    val sessionState = sparkSession.sessionState
+    val catalog = sessionState.catalog
+    if (sessionState.conf.autoSizeUpdateEnabled) {
       val newTable = catalog.getTableMetadata(table.identifier)
       val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable)
-      val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true)
+      val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes)
       if (isNewStats) {
         val newStats = CatalogStatistics(sizeInBytes = newSize)
         catalog.alterTableStats(table.identifier, Some(newStats))
       }
       if (newPartitions.nonEmpty) {
         catalog.alterPartitions(table.identifier, newPartitions)
       }
+    } else if (sessionState.conf.autoUpdateBasedOnMetricsEnabled && wroteStats.nonEmpty) {
+      val stat = wroteStats.get
+      stat.mode match {
+        case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
+          catalog.alterTableStats(table.identifier,
+            Some(CatalogStatistics(stat.numBytes, stat.numRows)))

Review Comment:
   Hi, @jackylee-ch  Thanks for your review.  It seems we can only update stats for overwriting non-partition table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1016538164


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -53,19 +53,41 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 object CommandUtils extends Logging {
 
   /** Change statistics after changing data by commands. */
-  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    val catalog = sparkSession.sessionState.catalog
-    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+  def updateTableStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      wroteStats: Option[WriteStats] = None): Unit = {
+    val sessionState = sparkSession.sessionState
+    val catalog = sessionState.catalog
+    if (sessionState.conf.autoSizeUpdateEnabled) {
       val newTable = catalog.getTableMetadata(table.identifier)
       val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable)
-      val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true)
+      val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes)
       if (isNewStats) {
         val newStats = CatalogStatistics(sizeInBytes = newSize)
         catalog.alterTableStats(table.identifier, Some(newStats))
       }
       if (newPartitions.nonEmpty) {
         catalog.alterPartitions(table.identifier, newPartitions)
       }
+    } else if (sessionState.conf.autoUpdateBasedOnMetricsEnabled && wroteStats.nonEmpty) {
+      val stat = wroteStats.get
+      stat.mode match {
+        case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
+          catalog.alterTableStats(table.identifier,
+            Some(CatalogStatistics(stat.numBytes, stat.numRows)))

Review Comment:
   hmm... so for `overwriting non-partition table`,  if `autoSizeUpdateEnabled` is true, we cannot use `wroteStats` to update `statistics`?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38496:
URL: https://github.com/apache/spark/pull/38496#discussion_r1016539534


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##########
@@ -53,19 +53,33 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
 object CommandUtils extends Logging {
 
   /** Change statistics after changing data by commands. */
-  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    val catalog = sparkSession.sessionState.catalog
-    if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+  def updateTableStats(
+      sparkSession: SparkSession,
+      table: CatalogTable,
+      wroteStats: Option[WriteStats] = None): Unit = {
+    val sessionState = sparkSession.sessionState
+    val catalog = sessionState.catalog
+    if (sessionState.conf.autoSizeUpdateEnabled) {
       val newTable = catalog.getTableMetadata(table.identifier)
       val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable)
-      val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true)
+      val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes)

Review Comment:
   Why change this line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

Posted by "melin (via GitHub)" <gi...@apache.org>.
melin commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1521033489

   > > Support partition statistics?
   > 
   > @melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics.
   
   Is it done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org