You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/11 19:55:50 UTC

[GitHub] [spark] MaxGekk opened a new pull request #31135: [SPARK-34074][SQL] Update stats when table size changes

MaxGekk opened a new pull request #31135:
URL: https://github.com/apache/spark/pull/31135


   ### What changes were proposed in this pull request?
   Do not alter table stats if they are the same as in the catalog (at least since the recent retrieve).
   
   ### Why are the changes needed?
   The changes reduce the number of calls to Hive external catalog.
   
   ### Does this PR introduce _any_ user-facing change?
   Should not.
   
   ### How was this patch tested?
   By running the modified test suites:
   ```
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31135:
URL: https://github.com/apache/spark/pull/31135#discussion_r555487286



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala
##########
@@ -30,17 +30,24 @@ class AlterTableDropPartitionSuite
   with CommandSuiteBase {
 
   test("hive client calls") {
-    withNamespaceAndTable("ns", "tbl") { t =>
-      sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY (part)")
-      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
-      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
-
-      checkHiveClientCalls(expected = 19) {
-        sql(s"ALTER TABLE $t DROP PARTITION (part=0)")
-      }
-      sql(s"CACHE TABLE $t")
-      checkHiveClientCalls(expected = 22) {
-        sql(s"ALTER TABLE $t DROP PARTITION (part=1)")
+    Seq(false, true).foreach { statsOn =>
+      withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> statsOn.toString) {
+        withNamespaceAndTable("ns", "tbl") { t =>
+          sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY (part)")
+          sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+          sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+          sql(s"ALTER TABLE $t ADD PARTITION (part=2)") // empty partition
+          checkHiveClientCalls(expected = if (statsOn) 27 else 19) {
+            sql(s"ALTER TABLE $t DROP PARTITION (part=2)")
+          }
+          checkHiveClientCalls(expected = if (statsOn) 32 else 19) {

Review comment:
       yea we should improve it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758332525


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133939/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31135: [SPARK-34074][SQL] Update stats when table size changes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758230661


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38527/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758204767


   **[Test build #133939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133939/testReport)** for PR 31135 at commit [`749593d`](https://github.com/apache/spark/commit/749593da199b36b4ab591468bf685f3c14baef60).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31135: [SPARK-34074][SQL] Update stats when table size changes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758204767


   **[Test build #133939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133939/testReport)** for PR 31135 at commit [`749593d`](https://github.com/apache/spark/commit/749593da199b36b4ab591468bf685f3c14baef60).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31135: [SPARK-34074][SQL] Update stats when table size changes

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31135:
URL: https://github.com/apache/spark/pull/31135#discussion_r555307842



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala
##########
@@ -30,17 +30,24 @@ class AlterTableDropPartitionSuite
   with CommandSuiteBase {
 
   test("hive client calls") {
-    withNamespaceAndTable("ns", "tbl") { t =>
-      sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY (part)")
-      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
-      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
-
-      checkHiveClientCalls(expected = 19) {
-        sql(s"ALTER TABLE $t DROP PARTITION (part=0)")
-      }
-      sql(s"CACHE TABLE $t")
-      checkHiveClientCalls(expected = 22) {
-        sql(s"ALTER TABLE $t DROP PARTITION (part=1)")
+    Seq(false, true).foreach { statsOn =>
+      withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> statsOn.toString) {
+        withNamespaceAndTable("ns", "tbl") { t =>
+          sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY (part)")
+          sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+          sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+          sql(s"ALTER TABLE $t ADD PARTITION (part=2)") // empty partition
+          checkHiveClientCalls(expected = if (statsOn) 27 else 19) {
+            sql(s"ALTER TABLE $t DROP PARTITION (part=2)")
+          }
+          checkHiveClientCalls(expected = if (statsOn) 32 else 19) {

Review comment:
       I could understand that we don't care of DDL command performance so much but 32 calls to the catalog (the calls can go over network to external dbms). @cloud-fan @HyukjinKwon @dongjoon-hyun Isn't it too much?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31135: [SPARK-34074][SQL] Update stats when table size changes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758245914


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38527/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758371089


   thanks, merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31135: [SPARK-34074][SQL] Update stats when table size changes

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31135:
URL: https://github.com/apache/spark/pull/31135#discussion_r555302962



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLCommandTestUtils.scala
##########
@@ -91,4 +91,20 @@ trait DDLCommandTestUtils extends SQLTestUtils {
   }
 
   protected def checkLocation(t: String, spec: TablePartitionSpec, expected: String): Unit
+
+  // Getting the total table size in the filesystem in bytes
+  def getTableSize(tableName: String): Int = {

Review comment:
       This PR shares the function with https://github.com/apache/spark/pull/31131 . As soon as one of them is merged, I will rebase.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758265242


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38527/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758332525


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133939/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #31135:
URL: https://github.com/apache/spark/pull/31135


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758265242


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38527/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31135: [SPARK-34074][SQL] Update stats only when table size changes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31135:
URL: https://github.com/apache/spark/pull/31135#issuecomment-758320090


   **[Test build #133939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133939/testReport)** for PR 31135 at commit [`749593d`](https://github.com/apache/spark/commit/749593da199b36b4ab591468bf685f3c14baef60).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org