You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/13 09:32:21 UTC

[GitHub] [spark] LuciferYang opened a new pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

LuciferYang opened a new pull request #30026:
URL: https://github.com/apache/spark/pull/30026


   ### What changes were proposed in this pull request?
   
   The purpose of this pr is to resolve SPARK-32978.
   
   The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side.
   
   The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct.
    
   ### Why are the changes needed?
   The the number of dynamic part metric we display on the UI should be correct.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Add a new test case refer to described in SPARK-32978
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234551






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714492579






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711456094


   **[Test build #129971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129971/testReport)** for PR 30026 at commit [`d630653`](https://github.com/apache/spark/commit/d6306530ff512678bdf5e683417f70250c07fca5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234551


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709408879


   **[Test build #129843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129843/testReport)** for PR 30026 at commit [`9edf8ad`](https://github.com/apache/spark/commit/9edf8ad99b32ea4b198d13ab197330ff875ba70c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709363138


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129828/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506446756



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
 
 
   override def newPartition(partitionValues: InternalRow): Unit = {
-    numPartitions += 1
+    partitions = partitions :+ partitionValues

Review comment:
       Address 6d80788 change to use `ArrayBuffer` to store `partitions ` and there change to `partitions.appended(partitionValues)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507352925



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()
+
+      // Wait for listener to finish computing the metrics for the executions.
+      while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+        statusStore.executionsList().last.metricValues == null) {
+        Thread.sleep(100)
+      }
+
+      // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,

Review comment:
       > why there are 4? is it because of collect?
   
   Yes, without `.collect` should be 2.
   
   > BTW can we call val oldExecutionsSize = statusStore.executionsList().size after create table? then we just need to wait for one SQLExecutionUIData.
   
   Address 15c7519 fix this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506291180



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
 
 
   override def newPartition(partitionValues: InternalRow): Unit = {
-    numPartitions += 1
+    partitions = partitions :+ partitionValues

Review comment:
       or `add`? doesn't scala 2.12 have java-like method to add elements? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714575551


   **[Test build #130152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130152/testReport)** for PR 30026 at commit [`4f47e20`](https://github.com/apache/spark/commit/4f47e205f2d130cf432b1a6c42589f919f2852e5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710010756


   **[Test build #129885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129885/testReport)** for PR 30026 at commit [`5769222`](https://github.com/apache/spark/commit/57692222be0ec45243c9fc574dd6ff06c87f9024).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506487800



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+    data.write.saveAsTable(sourceTable)
+    data.count()
+  }
+
+  def writeOnePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part) " +

Review comment:
       nit: this can be merged into the previous line

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+    data.write.saveAsTable(sourceTable)
+    data.count()
+  }
+
+  def writeOnePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part) " +
+        "select id, part1 as part from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part1, part2) " +

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509930936



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {

Review comment:
       `withTable` in `DataSourceWriteBenchmark ` is used(Line 91) to clean up the created table resources after benchmark

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {

Review comment:
       `withTable` method in `DataSourceWriteBenchmark ` is used(Line 91) to clean up the created table resources after benchmark




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277


   Address 724eee6 add a simple microbenchmark
   
   EDIT: Address 9edf8ad refactor microbenchmark to test more dynamic partitions number with JVM options `-Xmx4g -Xms4g`:
   
   **With this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark, totalRows = 200000
   Running case: one partition column, 100 partitions
   Stopped after 2 iterations, 10421 ms
   Running case: two partition columns, 500 partitions
   Stopped after 2 iterations, 49308 ms
   Running case: three partition columns, 2000 partitions
   Stopped after 2 iterations, 173533 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark, totalRows = 200000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   one partition column, 100 partitions                         4946           5211         374          0.0       24731.0       1.0X
   two partition columns, 500 partitions                       22929          24654        2440          0.0      114645.4       0.2X
   three partition columns, 2000 partitions                    82092          86767        2609          0.0      410461.3       0.1X
   
   
   ```
   
   **Without this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark, totalRows = 200000
     Running case: one partition column, 100 partitions
     Stopped after 2 iterations, 10252 ms
     Running case: two partition columns, 500 partitions
     Stopped after 2 iterations, 45089 ms
     Running case: three partition columns, 2000 partitions
     Stopped after 2 iterations, 198925 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark, totalRows = 200000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   one partition column, 100 partitions                         4840           5126         404          0.0       24201.4       1.0X
   two partition columns, 500 partitions                       20978          22545        2215          0.0      104892.0       0.2X
   three partition columns, 2000 partitions                    86858          99463        2043          0.0      434288.8       0.1X
   
   ```
   
   cc @cloud-fan seems no essential difference, It looks better than expected


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710060401


   **[Test build #129898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129898/testReport)** for PR 30026 at commit [`6d80788`](https://github.com/apache/spark/commit/6d8078814886e94bd54d0058174a47e5ff607723).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714437334


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34759/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472782


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709932689


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34490/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710011549






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r503951004



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.util.SerializableConfiguration
 
 
+
 /**
  * Simple metrics collected during an instance of [[FileFormatDataWriter]].
  * These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
  */
 case class BasicWriteTaskStats(
-    numPartitions: Int,
+    partitions: Seq[InternalRow],

Review comment:
       A benchmark for latency or memory usage? I can try it ~




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709449144






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710303576


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129898/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707655719


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34347/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707652650


   @AngersZhuuuu That's what I'm worried about :( , but I didn't think of a better way for the moment
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506489291



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()

Review comment:
       nit: we don't need to call `.collect` to trigger DDL/DML commands. `sql(...)` already does the job.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714290148


   **[Test build #130137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130137/testReport)** for PR 30026 at commit [`fa01951`](https://github.com/apache/spark/commit/fa019515b88c0571394a4207226f33631aa04855).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709217970


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34434/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506151340



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
 
   override def processStats(stats: Seq[WriteTaskStats]): Unit = {
     val sparkContext = SparkContext.getActive.get
-    var numPartitions: Long = 0L
+    var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
     var numFiles: Long = 0L
     var totalNumBytes: Long = 0L
     var totalNumOutput: Long = 0L
 
     val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
 
     basicStats.foreach { summary =>
-      numPartitions += summary.numPartitions
+      partitionsSet ++= summary.partitions
       numFiles += summary.numFiles
       totalNumBytes += summary.numBytes
       totalNumOutput += summary.numRows
     }
 
+    val numPartitions: Long = partitionsSet.size

Review comment:
       Address 5769222 fix this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506112571



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
 
   override def processStats(stats: Seq[WriteTaskStats]): Unit = {
     val sparkContext = SparkContext.getActive.get
-    var numPartitions: Long = 0L
+    var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
     var numFiles: Long = 0L
     var totalNumBytes: Long = 0L
     var totalNumOutput: Long = 0L
 
     val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
 
     basicStats.foreach { summary =>
-      numPartitions += summary.numPartitions
+      partitionsSet ++= summary.partitions
       numFiles += summary.numFiles
       totalNumBytes += summary.numBytes
       totalNumOutput += summary.numRows
     }
 
+    val numPartitions: Long = partitionsSet.size

Review comment:
       nit: it's only used once, we can inline it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711715975






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714515351


   thanks, merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277


   Address 724eee6 add a simple microbenchmark/
   
   **With this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: insert table with 10000 rows, one partition column, 10 partitions
     Stopped after 2 iterations, 2134 ms
     Running case: insert table with 10000 rows, one partition column, 50 partitions
     Stopped after 2 iterations, 7206 ms
     Running case: insert table with 10000 rows, one partition column, 100 partitions
     Stopped after 2 iterations, 9105 ms
     Running case: insert table with 10000 rows, one partition column, 200 partitions
     Stopped after 2 iterations, 14778 ms
     Running case: insert table with 10000 rows, one partition column, 500 partitions
     Stopped after 2 iterations, 42992 ms
     Running case: insert table with 10000 rows, two partition columns, 10 partitions
     Stopped after 2 iterations, 2331 ms
     Running case: insert table with 10000 rows, two partition columns, 50 partitions
     Stopped after 2 iterations, 6768 ms
     Running case: insert table with 10000 rows, two partition columns, 100 partitions
     Stopped after 2 iterations, 9274 ms
     Running case: insert table with 10000 rows, two partition columns, 200 partitions
     Stopped after 2 iterations, 17487 ms
     Running case: insert table with 10000 rows, two partition columns, 500 partitions
     Stopped after 2 iterations, 54044 ms
     Running case: insert table with 10000 rows, three partition columns, 10 partitions
     Stopped after 2 iterations, 2368 ms
     Running case: insert table with 10000 rows, three partition columns, 50 partitions
     Stopped after 2 iterations, 5538 ms
     Running case: insert table with 10000 rows, three partition columns, 100 partitions
     Stopped after 2 iterations, 11687 ms
     Running case: insert table with 10000 rows, three partition columns, 200 partitions
     Stopped after 2 iterations, 22371 ms
     Running case: insert table with 10000 rows, three partition columns, 500 partitions
     Stopped after 2 iterations, 55828 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   insert table with 10000 rows, one partition column, 10 partitions                922           1067         206          0.0       92182.2       1.0X
   insert table with 10000 rows, one partition column, 50 partitions               3265           3603         478          0.0      326535.4       0.3X
   insert table with 10000 rows, one partition column, 100 partitions              4390           4553         230          0.0      438988.3       0.2X
   insert table with 10000 rows, one partition column, 200 partitions              6585           7389        1137          0.0      658477.7       0.1X
   insert table with 10000 rows, one partition column, 500 partitions             20220          21496        1805          0.0     2022011.5       0.0X
   insert table with 10000 rows, two partition columns, 10 partitions              1114           1166          72          0.0      111432.2       0.8X
   insert table with 10000 rows, two partition columns, 50 partitions              2467           3384        1297          0.0      246670.3       0.4X
   insert table with 10000 rows, two partition columns, 100 partitions             4559           4637         110          0.0      455904.3       0.2X
   insert table with 10000 rows, two partition columns, 200 partitions             8631           8744         159          0.0      863130.8       0.1X
   insert table with 10000 rows, two partition columns, 500 partitions            23806          27022        1498          0.0     2380574.6       0.0X
   insert table with 10000 rows, three partition columns, 10 partitions            1096           1184         125          0.0      109639.4       0.8X
   insert table with 10000 rows, three partition columns, 50 partitions            2694           2769         107          0.0      269364.4       0.3X
   insert table with 10000 rows, three partition columns, 100 partitions           5701           5844         202          0.0      570137.3       0.2X
   insert table with 10000 rows, three partition columns, 200 partitions          11105          11186         115          0.0     1110452.3       0.1X
   insert table with 10000 rows, three partition columns, 500 partitions          26978          27914        1324          0.0     2697786.6       0.0X
   ```
   
   **Without this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: insert table with 10000 rows, one partition column, 10 partitions
     Stopped after 3 iterations, 2356 ms
     Running case: insert table with 10000 rows, one partition column, 50 partitions
     Stopped after 2 iterations, 6328 ms
     Running case: insert table with 10000 rows, one partition column, 100 partitions
     Stopped after 2 iterations, 8942 ms
     Running case: insert table with 10000 rows, one partition column, 200 partitions
     Stopped after 2 iterations, 17401 ms
     Running case: insert table with 10000 rows, one partition column, 500 partitions
     Stopped after 2 iterations, 43009 ms
     Running case: insert table with 10000 rows, two partition columns, 10 partitions
     Stopped after 3 iterations, 2024 ms
     Running case: insert table with 10000 rows, two partition columns, 50 partitions
     Stopped after 2 iterations, 4862 ms
     Running case: insert table with 10000 rows, two partition columns, 100 partitions
     Stopped after 2 iterations, 11229 ms
     Running case: insert table with 10000 rows, two partition columns, 200 partitions
     Stopped after 2 iterations, 18244 ms
     Running case: insert table with 10000 rows, two partition columns, 500 partitions
     Stopped after 2 iterations, 54922 ms
     Running case: insert table with 10000 rows, three partition columns, 10 partitions
     Stopped after 3 iterations, 2173 ms
     Running case: insert table with 10000 rows, three partition columns, 50 partitions
     Stopped after 2 iterations, 5660 ms
     Running case: insert table with 10000 rows, three partition columns, 100 partitions
     Stopped after 2 iterations, 14925 ms
     Running case: insert table with 10000 rows, three partition columns, 200 partitions
     Stopped after 2 iterations, 28378 ms
     Running case: insert table with 10000 rows, three partition columns, 500 partitions
     Stopped after 2 iterations, 59941 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   insert table with 10000 rows, one partition column, 10 partitions                715            785          98          0.0       71450.9       1.0X
   insert table with 10000 rows, one partition column, 50 partitions               3132           3164          46          0.0      313201.2       0.2X
   insert table with 10000 rows, one partition column, 100 partitions              4375           4471         136          0.0      437546.1       0.2X
   insert table with 10000 rows, one partition column, 200 partitions              7968           8701        1035          0.0      796846.9       0.1X
   insert table with 10000 rows, one partition column, 500 partitions             19208          21505         NaN          0.0     1920778.4       0.0X
   insert table with 10000 rows, two partition columns, 10 partitions               600            675          66          0.0       60027.5       1.2X
   insert table with 10000 rows, two partition columns, 50 partitions              2372           2431          83          0.0      237244.2       0.3X
   insert table with 10000 rows, two partition columns, 100 partitions             5385           5615         325          0.0      538471.6       0.1X
   insert table with 10000 rows, two partition columns, 200 partitions             8496           9122         885          0.0      849591.8       0.1X
   insert table with 10000 rows, two partition columns, 500 partitions            25747          27461        2424          0.0     2574722.7       0.0X
   insert table with 10000 rows, three partition columns, 10 partitions             687            725          35          0.0       68748.8       1.0X
   insert table with 10000 rows, three partition columns, 50 partitions            2757           2830         104          0.0      275692.0       0.3X
   insert table with 10000 rows, three partition columns, 100 partitions           6336           7463        1594          0.0      633568.3       0.1X
   insert table with 10000 rows, three partition columns, 200 partitions          14046          14189         202          0.0     1404645.4       0.1X
   insert table with 10000 rows, three partition columns, 500 partitions          26749          29971        1520          0.0     2674929.0       0.0X
   
   ```
   
   
   cc @cloud-fan seems no essential difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710079144


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34503/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507311420



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()

Review comment:
       some other problems without call `.collect`, let me re-check this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472782






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509930936



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {

Review comment:
       `withTable` in `DataSourceWriteBenchmark ` is used to clean up the created table resources after benchmark




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707663493


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34347/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714491110


   **[Test build #130137 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130137/testReport)** for PR 30026 at commit [`fa01951`](https://github.com/apache/spark/commit/fa019515b88c0571394a4207226f33631aa04855).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277


   Address 724eee6 add a simple microbenchmark
   
   **With this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: insert table with 10000 rows, one partition column, 10 partitions
     Stopped after 2 iterations, 2134 ms
     Running case: insert table with 10000 rows, one partition column, 50 partitions
     Stopped after 2 iterations, 7206 ms
     Running case: insert table with 10000 rows, one partition column, 100 partitions
     Stopped after 2 iterations, 9105 ms
     Running case: insert table with 10000 rows, one partition column, 200 partitions
     Stopped after 2 iterations, 14778 ms
     Running case: insert table with 10000 rows, one partition column, 500 partitions
     Stopped after 2 iterations, 42992 ms
     Running case: insert table with 10000 rows, two partition columns, 10 partitions
     Stopped after 2 iterations, 2331 ms
     Running case: insert table with 10000 rows, two partition columns, 50 partitions
     Stopped after 2 iterations, 6768 ms
     Running case: insert table with 10000 rows, two partition columns, 100 partitions
     Stopped after 2 iterations, 9274 ms
     Running case: insert table with 10000 rows, two partition columns, 200 partitions
     Stopped after 2 iterations, 17487 ms
     Running case: insert table with 10000 rows, two partition columns, 500 partitions
     Stopped after 2 iterations, 54044 ms
     Running case: insert table with 10000 rows, three partition columns, 10 partitions
     Stopped after 2 iterations, 2368 ms
     Running case: insert table with 10000 rows, three partition columns, 50 partitions
     Stopped after 2 iterations, 5538 ms
     Running case: insert table with 10000 rows, three partition columns, 100 partitions
     Stopped after 2 iterations, 11687 ms
     Running case: insert table with 10000 rows, three partition columns, 200 partitions
     Stopped after 2 iterations, 22371 ms
     Running case: insert table with 10000 rows, three partition columns, 500 partitions
     Stopped after 2 iterations, 55828 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   insert table with 10000 rows, one partition column, 10 partitions                922           1067         206          0.0       92182.2       1.0X
   insert table with 10000 rows, one partition column, 50 partitions               3265           3603         478          0.0      326535.4       0.3X
   insert table with 10000 rows, one partition column, 100 partitions              4390           4553         230          0.0      438988.3       0.2X
   insert table with 10000 rows, one partition column, 200 partitions              6585           7389        1137          0.0      658477.7       0.1X
   insert table with 10000 rows, one partition column, 500 partitions             20220          21496        1805          0.0     2022011.5       0.0X
   insert table with 10000 rows, two partition columns, 10 partitions              1114           1166          72          0.0      111432.2       0.8X
   insert table with 10000 rows, two partition columns, 50 partitions              2467           3384        1297          0.0      246670.3       0.4X
   insert table with 10000 rows, two partition columns, 100 partitions             4559           4637         110          0.0      455904.3       0.2X
   insert table with 10000 rows, two partition columns, 200 partitions             8631           8744         159          0.0      863130.8       0.1X
   insert table with 10000 rows, two partition columns, 500 partitions            23806          27022        1498          0.0     2380574.6       0.0X
   insert table with 10000 rows, three partition columns, 10 partitions            1096           1184         125          0.0      109639.4       0.8X
   insert table with 10000 rows, three partition columns, 50 partitions            2694           2769         107          0.0      269364.4       0.3X
   insert table with 10000 rows, three partition columns, 100 partitions           5701           5844         202          0.0      570137.3       0.2X
   insert table with 10000 rows, three partition columns, 200 partitions          11105          11186         115          0.0     1110452.3       0.1X
   insert table with 10000 rows, three partition columns, 500 partitions          26978          27914        1324          0.0     2697786.6       0.0X
   ```
   
   **Without this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: insert table with 10000 rows, one partition column, 10 partitions
     Stopped after 3 iterations, 2356 ms
     Running case: insert table with 10000 rows, one partition column, 50 partitions
     Stopped after 2 iterations, 6328 ms
     Running case: insert table with 10000 rows, one partition column, 100 partitions
     Stopped after 2 iterations, 8942 ms
     Running case: insert table with 10000 rows, one partition column, 200 partitions
     Stopped after 2 iterations, 17401 ms
     Running case: insert table with 10000 rows, one partition column, 500 partitions
     Stopped after 2 iterations, 43009 ms
     Running case: insert table with 10000 rows, two partition columns, 10 partitions
     Stopped after 3 iterations, 2024 ms
     Running case: insert table with 10000 rows, two partition columns, 50 partitions
     Stopped after 2 iterations, 4862 ms
     Running case: insert table with 10000 rows, two partition columns, 100 partitions
     Stopped after 2 iterations, 11229 ms
     Running case: insert table with 10000 rows, two partition columns, 200 partitions
     Stopped after 2 iterations, 18244 ms
     Running case: insert table with 10000 rows, two partition columns, 500 partitions
     Stopped after 2 iterations, 54922 ms
     Running case: insert table with 10000 rows, three partition columns, 10 partitions
     Stopped after 3 iterations, 2173 ms
     Running case: insert table with 10000 rows, three partition columns, 50 partitions
     Stopped after 2 iterations, 5660 ms
     Running case: insert table with 10000 rows, three partition columns, 100 partitions
     Stopped after 2 iterations, 14925 ms
     Running case: insert table with 10000 rows, three partition columns, 200 partitions
     Stopped after 2 iterations, 28378 ms
     Running case: insert table with 10000 rows, three partition columns, 500 partitions
     Stopped after 2 iterations, 59941 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   insert table with 10000 rows, one partition column, 10 partitions                715            785          98          0.0       71450.9       1.0X
   insert table with 10000 rows, one partition column, 50 partitions               3132           3164          46          0.0      313201.2       0.2X
   insert table with 10000 rows, one partition column, 100 partitions              4375           4471         136          0.0      437546.1       0.2X
   insert table with 10000 rows, one partition column, 200 partitions              7968           8701        1035          0.0      796846.9       0.1X
   insert table with 10000 rows, one partition column, 500 partitions             19208          21505         NaN          0.0     1920778.4       0.0X
   insert table with 10000 rows, two partition columns, 10 partitions               600            675          66          0.0       60027.5       1.2X
   insert table with 10000 rows, two partition columns, 50 partitions              2372           2431          83          0.0      237244.2       0.3X
   insert table with 10000 rows, two partition columns, 100 partitions             5385           5615         325          0.0      538471.6       0.1X
   insert table with 10000 rows, two partition columns, 200 partitions             8496           9122         885          0.0      849591.8       0.1X
   insert table with 10000 rows, two partition columns, 500 partitions            25747          27461        2424          0.0     2574722.7       0.0X
   insert table with 10000 rows, three partition columns, 10 partitions             687            725          35          0.0       68748.8       1.0X
   insert table with 10000 rows, three partition columns, 50 partitions            2757           2830         104          0.0      275692.0       0.3X
   insert table with 10000 rows, three partition columns, 100 partitions           6336           7463        1594          0.0      633568.3       0.1X
   insert table with 10000 rows, three partition columns, 200 partitions          14046          14189         202          0.0     1404645.4       0.1X
   insert table with 10000 rows, three partition columns, 500 partitions          26749          29971        1520          0.0     2674929.0       0.0X
   
   ```
   
   
   cc @cloud-fan seems no essential difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711480889


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34581/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709868109


   > return size is partition num * shuffle num always can be millions level
   
   I thought about it. If a table has 10k partitions, it's unlikely that each write task touches all the 10k partitions. So the total size is not that large.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714335764


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709449095


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34449/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506295439



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")
+
+    spark.sql("create table " +
+      "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet")
+
+    spark.sql("insert overwrite table sourceTable " +
+      s"select id, " +
+      s"part1, " +
+      s"part2, " +
+      s"part3 " +
+      s"from tmpTable")
+
+    spark.catalog.dropTempView("tmpTable")
+    data.count()
+  }
+
+  def prepareTable(): Unit = {
+    spark.sql("create table " +
+      "tableOnePartitionColumn(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    spark.sql("create table " +
+      "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    spark.sql("create table " +
+      "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+  }
+
+  def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableOnePartitionColumn partition(part) " +
+        s"select id, " +
+        s"part1 as part " +
+        s"from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableTwoPartitionColumn partition(part1, part2) " +
+        s"select id, " +
+        s"part1, " +
+        s"part2 " +
+        s"from sourceTable")
+    }
+  }
+
+  def writeThreePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableThreePartitionColumn partition(part1, part2, part3) " +
+        s"select id, " +
+        s"part1, " +
+        s"part2, " +
+        s"part3 " +
+        s"from sourceTable")
+    }
+  }
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    val sourceTable = "sourceTable"
+    val tableOnePartitionColumn = "tableOnePartitionColumn"

Review comment:
       it can be simpler `val onePartTable =  ...`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707619254


   cc @wangyum , the current result should be correct, but will increase memory and network pressure 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507352925



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()
+
+      // Wait for listener to finish computing the metrics for the executions.
+      while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+        statusStore.executionsList().last.metricValues == null) {
+        Thread.sleep(100)
+      }
+
+      // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,

Review comment:
       Address 15c7519 fix this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711488360






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509898355



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"

Review comment:
       We need to run it and commit the benchmark result to the codebase.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709932719






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506491449



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()
+
+      // Wait for listener to finish computing the metrics for the executions.
+      while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+        statusStore.executionsList().last.metricValues == null) {
+        Thread.sleep(100)
+      }
+
+      // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,

Review comment:
       why there are 4? is it because of `collect`?
   
   BTW can we call `val oldExecutionsSize = statusStore.executionsList().size` after create table? then we just need to wait for one `SQLExecutionUIData`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710088277






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509899679



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()

Review comment:
       do we have to override it? The default one-thread spark session is better to reason about benchmark results.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709438264


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34449/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714530492


   thx~ @cloud-fan @wangyum @AngersZhuuuu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709555923






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710011549






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #30026:
URL: https://github.com/apache/spark/pull/30026


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711713292


   **[Test build #129973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129973/testReport)** for PR 30026 at commit [`15c7519`](https://github.com/apache/spark/commit/15c751961e43ea744f01f9f3264e487cb0254c36).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710303567


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506293143



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")

Review comment:
       BTW is it the same as `data.write.saveAsTable("sourceTable")`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714451930


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34759/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506110093



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.util.SerializableConfiguration
 
 
+

Review comment:
       unnecessary change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714409433


   **[Test build #130152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130152/testReport)** for PR 30026 at commit [`4f47e20`](https://github.com/apache/spark/commit/4f47e205f2d130cf432b1a6c42589f919f2852e5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709049996


   **[Test build #129828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129828/testReport)** for PR 30026 at commit [`724eee6`](https://github.com/apache/spark/commit/724eee6acfd3754ab14d83d132df492700988cfc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509899465



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {

Review comment:
       Seems we don't use the utils methods in `DataSourceWriteBenchmark`. I think we can implement `SqlBasedBenchmark` directly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714577119






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506292321



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")

Review comment:
       `tmpTable` -> `tmpView`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509930936



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {

Review comment:
       `withTable` in `DataSourceWriteBenchmark ` is used(line 91) to clean up the created table resources after benchmark




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709899037


   **[Test build #129885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129885/testReport)** for PR 30026 at commit [`5769222`](https://github.com/apache/spark/commit/57692222be0ec45243c9fc574dd6ff06c87f9024).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709237402






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234528


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34434/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472769


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34579/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711488360






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714318647


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707621997


   **[Test build #129741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129741/testReport)** for PR 30026 at commit [`0769ac1`](https://github.com/apache/spark/commit/0769ac1fbe3b379fc6482e88095e57a895b149f2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709363125


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506448136



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")
+
+    spark.sql("create table " +
+      "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet")
+
+    spark.sql("insert overwrite table sourceTable " +
+      s"select id, " +
+      s"part1, " +
+      s"part2, " +
+      s"part3 " +
+      s"from tmpTable")
+
+    spark.catalog.dropTempView("tmpTable")
+    data.count()
+  }
+
+  def prepareTable(): Unit = {
+    spark.sql("create table " +
+      "tableOnePartitionColumn(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    spark.sql("create table " +
+      "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    spark.sql("create table " +
+      "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+  }
+
+  def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableOnePartitionColumn partition(part) " +

Review comment:
       done~

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")
+
+    spark.sql("create table " +
+      "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet")
+
+    spark.sql("insert overwrite table sourceTable " +
+      s"select id, " +
+      s"part1, " +
+      s"part2, " +
+      s"part3 " +
+      s"from tmpTable")
+
+    spark.catalog.dropTempView("tmpTable")
+    data.count()
+  }
+
+  def prepareTable(): Unit = {
+    spark.sql("create table " +
+      "tableOnePartitionColumn(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    spark.sql("create table " +
+      "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    spark.sql("create table " +
+      "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+  }
+
+  def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableOnePartitionColumn partition(part) " +
+        s"select id, " +
+        s"part1 as part " +
+        s"from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableTwoPartitionColumn partition(part1, part2) " +
+        s"select id, " +
+        s"part1, " +
+        s"part2 " +
+        s"from sourceTable")
+    }
+  }
+
+  def writeThreePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableThreePartitionColumn partition(part1, part2, part3) " +
+        s"select id, " +
+        s"part1, " +
+        s"part2, " +
+        s"part3 " +
+        s"from sourceTable")
+    }
+  }
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    val sourceTable = "sourceTable"
+    val tableOnePartitionColumn = "tableOnePartitionColumn"

Review comment:
       done ~




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506295439



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")
+
+    spark.sql("create table " +
+      "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet")
+
+    spark.sql("insert overwrite table sourceTable " +
+      s"select id, " +
+      s"part1, " +
+      s"part2, " +
+      s"part3 " +
+      s"from tmpTable")
+
+    spark.catalog.dropTempView("tmpTable")
+    data.count()
+  }
+
+  def prepareTable(): Unit = {
+    spark.sql("create table " +
+      "tableOnePartitionColumn(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    spark.sql("create table " +
+      "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    spark.sql("create table " +
+      "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+  }
+
+  def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableOnePartitionColumn partition(part) " +
+        s"select id, " +
+        s"part1 as part " +
+        s"from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableTwoPartitionColumn partition(part1, part2) " +
+        s"select id, " +
+        s"part1, " +
+        s"part2 " +
+        s"from sourceTable")
+    }
+  }
+
+  def writeThreePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableThreePartitionColumn partition(part1, part2, part3) " +
+        s"select id, " +
+        s"part1, " +
+        s"part2, " +
+        s"part3 " +
+        s"from sourceTable")
+    }
+  }
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    val sourceTable = "sourceTable"
+    val tableOnePartitionColumn = "tableOnePartitionColumn"

Review comment:
       it can be simpler `val onePartColTable =  ...`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710088277






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277


   Address 724eee6 add a simple microbenchmark
   
   EDIT: Address 9edf8ad refactor microbenchmark to test more dynamic partitions number:
   
   **With this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark, totalRows = 200000
   Running case: one partition column, 100 partitions
   Stopped after 2 iterations, 10421 ms
   Running case: two partition columns, 500 partitions
   Stopped after 2 iterations, 49308 ms
   Running case: three partition columns, 2000 partitions
   Stopped after 2 iterations, 173533 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark, totalRows = 200000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   one partition column, 100 partitions                         4946           5211         374          0.0       24731.0       1.0X
   two partition columns, 500 partitions                       22929          24654        2440          0.0      114645.4       0.2X
   three partition columns, 2000 partitions                    82092          86767        2609          0.0      410461.3       0.1X
   
   
   ```
   
   **Without this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark, totalRows = 200000
     Running case: one partition column, 100 partitions
     Stopped after 2 iterations, 10252 ms
     Running case: two partition columns, 500 partitions
     Stopped after 2 iterations, 45089 ms
     Running case: three partition columns, 2000 partitions
     Stopped after 2 iterations, 198925 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark, totalRows = 200000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   one partition column, 100 partitions                         4840           5126         404          0.0       24201.4       1.0X
   two partition columns, 500 partitions                       20978          22545        2215          0.0      104892.0       0.2X
   three partition columns, 2000 partitions                    86858          99463        2043          0.0      434288.8       0.1X
   
   ```
   
   cc @cloud-fan seems no essential difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277


   Address 724eee6 add a simple microbenchmark
   
   **With this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: one partition column, 10 partitions
     Stopped after 2 iterations, 2134 ms
     Running case: one partition column, 50 partitions
     Stopped after 2 iterations, 7206 ms
     Running case: one partition column, 100 partitions
     Stopped after 2 iterations, 9105 ms
     Running case: one partition column, 200 partitions
     Stopped after 2 iterations, 14778 ms
     Running case: one partition column, 500 partitions
     Stopped after 2 iterations, 42992 ms
     Running case: two partition columns, 10 partitions
     Stopped after 2 iterations, 2331 ms
     Running case: two partition columns, 50 partitions
     Stopped after 2 iterations, 6768 ms
     Running case: two partition columns, 100 partitions
     Stopped after 2 iterations, 9274 ms
     Running case: two partition columns, 200 partitions
     Stopped after 2 iterations, 17487 ms
     Running case: two partition columns, 500 partitions
     Stopped after 2 iterations, 54044 ms
     Running case: three partition columns, 10 partitions
     Stopped after 2 iterations, 2368 ms
     Running case: three partition columns, 50 partitions
     Stopped after 2 iterations, 5538 ms
     Running case: three partition columns, 100 partitions
     Stopped after 2 iterations, 11687 ms
     Running case: three partition columns, 200 partitions
     Stopped after 2 iterations, 22371 ms
     Running case: three partition columns, 500 partitions
     Stopped after 2 iterations, 55828 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   one partition column, 10 partitions                922           1067         206          0.0       92182.2       1.0X
   one partition column, 50 partitions               3265           3603         478          0.0      326535.4       0.3X
   one partition column, 100 partitions              4390           4553         230          0.0      438988.3       0.2X
   one partition column, 200 partitions              6585           7389        1137          0.0      658477.7       0.1X
   one partition column, 500 partitions             20220          21496        1805          0.0     2022011.5       0.0X
   two partition columns, 10 partitions              1114           1166          72          0.0      111432.2       0.8X
   two partition columns, 50 partitions              2467           3384        1297          0.0      246670.3       0.4X
   two partition columns, 100 partitions             4559           4637         110          0.0      455904.3       0.2X
   two partition columns, 200 partitions             8631           8744         159          0.0      863130.8       0.1X
   two partition columns, 500 partitions            23806          27022        1498          0.0     2380574.6       0.0X
   three partition columns, 10 partitions            1096           1184         125          0.0      109639.4       0.8X
   three partition columns, 50 partitions            2694           2769         107          0.0      269364.4       0.3X
   three partition columns, 100 partitions           5701           5844         202          0.0      570137.3       0.2X
   three partition columns, 200 partitions          11105          11186         115          0.0     1110452.3       0.1X
   three partition columns, 500 partitions          26978          27914        1324          0.0     2697786.6       0.0X
   ```
   
   **Without this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: one partition column, 10 partitions
     Stopped after 3 iterations, 2356 ms
     Running case: one partition column, 50 partitions
     Stopped after 2 iterations, 6328 ms
     Running case: one partition column, 100 partitions
     Stopped after 2 iterations, 8942 ms
     Running case: one partition column, 200 partitions
     Stopped after 2 iterations, 17401 ms
     Running case: one partition column, 500 partitions
     Stopped after 2 iterations, 43009 ms
     Running case: two partition columns, 10 partitions
     Stopped after 3 iterations, 2024 ms
     Running case: two partition columns, 50 partitions
     Stopped after 2 iterations, 4862 ms
     Running case: two partition columns, 100 partitions
     Stopped after 2 iterations, 11229 ms
     Running case: two partition columns, 200 partitions
     Stopped after 2 iterations, 18244 ms
     Running case: two partition columns, 500 partitions
     Stopped after 2 iterations, 54922 ms
     Running case: three partition columns, 10 partitions
     Stopped after 3 iterations, 2173 ms
     Running case: three partition columns, 50 partitions
     Stopped after 2 iterations, 5660 ms
     Running case: three partition columns, 100 partitions
     Stopped after 2 iterations, 14925 ms
     Running case: three partition columns, 200 partitions
     Stopped after 2 iterations, 28378 ms
     Running case: three partition columns, 500 partitions
     Stopped after 2 iterations, 59941 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   one partition column, 10 partitions                715            785          98          0.0       71450.9       1.0X
   one partition column, 50 partitions               3132           3164          46          0.0      313201.2       0.2X
   one partition column, 100 partitions              4375           4471         136          0.0      437546.1       0.2X
   one partition column, 200 partitions              7968           8701        1035          0.0      796846.9       0.1X
   one partition column, 500 partitions             19208          21505         NaN          0.0     1920778.4       0.0X
   two partition columns, 10 partitions               600            675          66          0.0       60027.5       1.2X
   two partition columns, 50 partitions              2372           2431          83          0.0      237244.2       0.3X
   two partition columns, 100 partitions             5385           5615         325          0.0      538471.6       0.1X
   two partition columns, 200 partitions             8496           9122         885          0.0      849591.8       0.1X
   two partition columns, 500 partitions            25747          27461        2424          0.0     2574722.7       0.0X
   three partition columns, 10 partitions             687            725          35          0.0       68748.8       1.0X
   three partition columns, 50 partitions            2757           2830         104          0.0      275692.0       0.3X
   three partition columns, 100 partitions           6336           7463        1594          0.0      633568.3       0.1X
   three partition columns, 200 partitions          14046          14189         202          0.0     1404645.4       0.1X
   three partition columns, 500 partitions          26749          29971        1520          0.0     2674929.0       0.0X
   
   ```
   
   cc @cloud-fan seems no essential difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r510065629



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"

Review comment:
       Add 4f47e20 upload benchmark result file




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714409433


   **[Test build #130152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130152/testReport)** for PR 30026 at commit [`4f47e20`](https://github.com/apache/spark/commit/4f47e205f2d130cf432b1a6c42589f919f2852e5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506488151



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+    data.write.saveAsTable(sourceTable)
+    data.count()
+  }
+
+  def writeOnePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part) " +
+        "select id, part1 as part from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part1, part2) " +
+        "select id, part1, part2 from sourceTable")
+    }
+  }
+
+  def writeThreePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+    benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part1, part2, part3) " +

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714335785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709554675


   **[Test build #129843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129843/testReport)** for PR 30026 at commit [`9edf8ad`](https://github.com/apache/spark/commit/9edf8ad99b32ea4b198d13ab197330ff875ba70c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711469815


   **[Test build #129973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129973/testReport)** for PR 30026 at commit [`15c7519`](https://github.com/apache/spark/commit/15c751961e43ea744f01f9f3264e487cb0254c36).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709237402






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711469815


   **[Test build #129973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129973/testReport)** for PR 30026 at commit [`15c7519`](https://github.com/apache/spark/commit/15c751961e43ea744f01f9f3264e487cb0254c36).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709078492


   **[Test build #129831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129831/testReport)** for PR 30026 at commit [`6127fa5`](https://github.com/apache/spark/commit/6127fa56abf72ba5112092b9000200934e199b59).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707640761


   I have thought to do like this too, but if partition num is too big, return partition value will occupy too many memory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776740


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129741/
   Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506112213



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
 
   override def processStats(stats: Seq[WriteTaskStats]): Unit = {
     val sparkContext = SparkContext.getActive.get
-    var numPartitions: Long = 0L
+    var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
     var numFiles: Long = 0L
     var totalNumBytes: Long = 0L
     var totalNumOutput: Long = 0L
 
     val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
 
     basicStats.foreach { summary =>
-      numPartitions += summary.numPartitions
+      partitionsSet ++= summary.partitions

Review comment:
       ditto, `partitionsSet.addAll(summary.partitions)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710096679


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34504/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710088262


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34503/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506111562



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
 
 
   override def newPartition(partitionValues: InternalRow): Unit = {
-    numPartitions += 1
+    partitions = partitions :+ partitionValues

Review comment:
       this looks like appending a immutable collection. Can we be more explicit? e.g. `partitions.append(partitionValues)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707621997


   **[Test build #129741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129741/testReport)** for PR 30026 at commit [`0769ac1`](https://github.com/apache/spark/commit/0769ac1fbe3b379fc6482e88095e57a895b149f2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710303567






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711604228


   **[Test build #129971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129971/testReport)** for PR 30026 at commit [`d630653`](https://github.com/apache/spark/commit/d6306530ff512678bdf5e683417f70250c07fca5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709393180






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709393180






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710096702






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507352925



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()
+
+      // Wait for listener to finish computing the metrics for the executions.
+      while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+        statusStore.executionsList().last.metricValues == null) {
+        Thread.sleep(100)
+      }
+
+      // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,

Review comment:
       > why there are 4? is it because of collect?
   
   Yes, without `.collect` should be 2.
   
   > BTW can we call val oldExecutionsSize = statusStore.executionsList().size after create table? then we just need to wait for one SQLExecutionUIData.
   
   @cloud-fan Address 15c7519 fix this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714290148


   **[Test build #130137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130137/testReport)** for PR 30026 at commit [`fa01951`](https://github.com/apache/spark/commit/fa019515b88c0571394a4207226f33631aa04855).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277


   Address 724eee6 add a simple microbenchmark
   
   **With this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: one partition column, 10 partitions
     Stopped after 2 iterations, 2134 ms
     Running case: one partition column, 50 partitions
     Stopped after 2 iterations, 7206 ms
     Running case: one partition column, 100 partitions
     Stopped after 2 iterations, 9105 ms
     Running case: one partition column, 200 partitions
     Stopped after 2 iterations, 14778 ms
     Running case: one partition column, 500 partitions
     Stopped after 2 iterations, 42992 ms
     Running case: two partition columns, 10 partitions
     Stopped after 2 iterations, 2331 ms
     Running case: two partition columns, 50 partitions
     Stopped after 2 iterations, 6768 ms
     Running case: two partition columns, 100 partitions
     Stopped after 2 iterations, 9274 ms
     Running case: two partition columns, 200 partitions
     Stopped after 2 iterations, 17487 ms
     Running case: two partition columns, 500 partitions
     Stopped after 2 iterations, 54044 ms
     Running case: three partition columns, 10 partitions
     Stopped after 2 iterations, 2368 ms
     Running case: three partition columns, 50 partitions
     Stopped after 2 iterations, 5538 ms
     Running case: three partition columns, 100 partitions
     Stopped after 2 iterations, 11687 ms
     Running case: three partition columns, 200 partitions
     Stopped after 2 iterations, 22371 ms
     Running case: three partition columns, 500 partitions
     Stopped after 2 iterations, 55828 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:                                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------------
   one partition column, 10 partitions                922           1067         206          0.0       92182.2       1.0X
   one partition column, 50 partitions               3265           3603         478          0.0      326535.4       0.3X
   one partition column, 100 partitions              4390           4553         230          0.0      438988.3       0.2X
   one partition column, 200 partitions              6585           7389        1137          0.0      658477.7       0.1X
   one partition column, 500 partitions             20220          21496        1805          0.0     2022011.5       0.0X
   two partition columns, 10 partitions              1114           1166          72          0.0      111432.2       0.8X
   two partition columns, 50 partitions              2467           3384        1297          0.0      246670.3       0.4X
   two partition columns, 100 partitions             4559           4637         110          0.0      455904.3       0.2X
   two partition columns, 200 partitions             8631           8744         159          0.0      863130.8       0.1X
   two partition columns, 500 partitions            23806          27022        1498          0.0     2380574.6       0.0X
   three partition columns, 10 partitions            1096           1184         125          0.0      109639.4       0.8X
   three partition columns, 50 partitions            2694           2769         107          0.0      269364.4       0.3X
   three partition columns, 100 partitions           5701           5844         202          0.0      570137.3       0.2X
   three partition columns, 200 partitions          11105          11186         115          0.0     1110452.3       0.1X
   three partition columns, 500 partitions          26978          27914        1324          0.0     2697786.6       0.0X
   ```
   
   **Without this pr** the result is :
   
   ```
   Running benchmark: dynamic insert table benchmark
     Running case: one partition column, 10 partitions
     Stopped after 3 iterations, 2610 ms
     Running case: one partition column, 50 partitions
     Stopped after 2 iterations, 5651 ms
     Running case: one partition column, 100 partitions
     Stopped after 2 iterations, 8813 ms
     Running case: one partition column, 200 partitions
     Stopped after 2 iterations, 16323 ms
     Running case: one partition column, 500 partitions
     Stopped after 2 iterations, 38269 ms
     Running case: two partition columns, 10 partitions
     Stopped after 3 iterations, 2730 ms
     Running case: two partition columns, 50 partitions
     Stopped after 2 iterations, 5574 ms
     Running case: two partition columns, 100 partitions
     Stopped after 2 iterations, 15787 ms
     Running case: two partition columns, 200 partitions
     Stopped after 2 iterations, 18852 ms
     Running case: two partition columns, 500 partitions
     Stopped after 2 iterations, 52470 ms
     Running case: three partition columns, 10 partitions
     Stopped after 3 iterations, 2366 ms
     Running case: three partition columns, 50 partitions
     Stopped after 2 iterations, 8141 ms
     Running case: three partition columns, 100 partitions
     Stopped after 2 iterations, 12490 ms
     Running case: three partition columns, 200 partitions
     Stopped after 2 iterations, 26581 ms
     Running case: three partition columns, 500 partitions
     Stopped after 2 iterations, 64463 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   dynamic insert table benchmark:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   one partition column, 10 partitions                 789            870          72          0.0       78864.1       1.0X
   one partition column, 50 partitions                2697           2826         182          0.0      269734.5       0.3X
   one partition column, 100 partitions               4254           4407         216          0.0      425382.7       0.2X
   one partition column, 200 partitions               8057           8162         148          0.0      805674.5       0.1X
   one partition column, 500 partitions              18896          19135         338          0.0     1889591.7       0.0X
   two partition columns, 10 partitions                754            910         241          0.0       75358.7       1.0X
   two partition columns, 50 partitions               2701           2787         122          0.0      270120.7       0.3X
   two partition columns, 100 partitions              7341           7894         782          0.0      734065.0       0.1X
   two partition columns, 200 partitions              9404           9426          32          0.0      940371.7       0.1X
   two partition columns, 500 partitions             23720          26235         NaN          0.0     2371963.0       0.0X
   three partition columns, 10 partitions              751            789          38          0.0       75076.4       1.1X
   three partition columns, 50 partitions             3802           4071         380          0.0      380180.7       0.2X
   three partition columns, 100 partitions            6072           6245         245          0.0      607224.0       0.1X
   three partition columns, 200 partitions           12874          13291         590          0.0     1287360.6       0.1X
   three partition columns, 500 partitions           31451          32232        1104          0.0     3145143.9       0.0X
   ```
   
   cc @cloud-fan seems no essential difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507361926



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()

Review comment:
       no other problems, Address 15c7519 fix this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507304846



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+    data.write.saveAsTable(sourceTable)
+    data.count()
+  }
+
+  def writeOnePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part) " +

Review comment:
       done

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+    data.write.saveAsTable(sourceTable)
+    data.count()
+  }
+
+  def writeOnePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part) " +
+        "select id, part1 as part from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part1, part2) " +

Review comment:
       done

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+    data.write.saveAsTable(sourceTable)
+    data.count()
+  }
+
+  def writeOnePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part) " +
+        "select id, part1 as part from sourceTable")
+    }
+  }
+
+  def writeTwoPartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part1, part2) " +
+        "select id, part1, part2 from sourceTable")
+    }
+  }
+
+  def writeThreePartitionColumnTable(tableName: String,
+      partitionNumber: Long, benchmark: Benchmark): Unit = {
+    spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+    benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        s"$tableName partition(part1, part2, part3) " +

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

wangyum commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707620097


   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714335785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r503813333



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+  test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+    try {
+      val partitions = "50"
+      spark = SparkSession.builder().master("local[4]").getOrCreate()
+      val statusStore = spark.sharedState.statusStore
+      val oldExecutionsSize = statusStore.executionsList().size
+
+      spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+        "using parquet partitioned by (part)").collect()
+      spark.sql("insert overwrite table dynamic_partition partition(part) " +
+        s"select id, id % $partitions as part from range(10000)").collect()
+
+      // Wait for listener to finish computing the metrics for the executions.
+      while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+        statusStore.executionsList().last.metricValues == null) {
+        Thread.sleep(100)
+      }
+
+      // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,
+      // but the executionId is indeterminate in maven test,
+      // so the `statusStore.execution(executionId)` API is not used.
+      assert(statusStore.executionsCount() == 4)
+      val executionData = statusStore.executionsList()(2)
+      val accumulatorIdOpt =
+        executionData.metrics.find(_.name == "number of dynamic part").map(_.accumulatorId)
+      assert(accumulatorIdOpt.isDefined)
+      val numPartsOpt = executionData.metricValues.get(accumulatorIdOpt.get)
+      assert(numPartsOpt.isDefined && numPartsOpt.get == partitions)

Review comment:
       Without change of `BasicWriteStatsTracker.scala`, the numParts will be 200




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234603


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34434/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-712648787


   cc @cloud-fan Any other problems need to be fixed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709449144






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709225205


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34436/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709932719






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709899037


   **[Test build #129885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129885/testReport)** for PR 30026 at commit [`5769222`](https://github.com/apache/spark/commit/57692222be0ec45243c9fc574dd6ff06c87f9024).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709363125






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710086101


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34504/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707652650


   @AngersZhuuuu That's what I'm worried about :( 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714577119






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707663518






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711715975






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709923031


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34490/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714451946






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776725


   Merged build finished. Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709408879


   **[Test build #129843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129843/testReport)** for PR 30026 at commit [`9edf8ad`](https://github.com/apache/spark/commit/9edf8ad99b32ea4b198d13ab197330ff875ba70c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776740


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129741/
   Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506448618



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")

Review comment:
       done ~




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710060401


   **[Test build #129898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129898/testReport)** for PR 30026 at commit [`6d80788`](https://github.com/apache/spark/commit/6d8078814886e94bd54d0058174a47e5ff607723).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709049996


   **[Test build #129828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129828/testReport)** for PR 30026 at commit [`724eee6`](https://github.com/apache/spark/commit/724eee6acfd3754ab14d83d132df492700988cfc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709078492


   **[Test build #129831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129831/testReport)** for PR 30026 at commit [`6127fa5`](https://github.com/apache/spark/commit/6127fa56abf72ba5112092b9000200934e199b59).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506147287



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
 
 
   override def newPartition(partitionValues: InternalRow): Unit = {
-    numPartitions += 1
+    partitions = partitions :+ partitionValues

Review comment:
       `partitions.appended(partitionValues)` need Scala 2.13




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710056295


   Address 73e2ea6 reorganize the benchmark file


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711488346


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34581/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714492579






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709361686


   **[Test build #129828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129828/testReport)** for PR 30026 at commit [`724eee6`](https://github.com/apache/spark/commit/724eee6acfd3754ab14d83d132df492700988cfc).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714451946






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472789


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34579/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709555923






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776725


   Merged build finished. Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710300555


   **[Test build #129898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129898/testReport)** for PR 30026 at commit [`6d80788`](https://github.com/apache/spark/commit/6d8078814886e94bd54d0058174a47e5ff607723).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509932012



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()

Review comment:
       Address fa01951 fix this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506294887



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")
+
+    spark.sql("create table " +
+      "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet")
+
+    spark.sql("insert overwrite table sourceTable " +
+      s"select id, " +
+      s"part1, " +
+      s"part2, " +
+      s"part3 " +
+      s"from tmpTable")
+
+    spark.catalog.dropTempView("tmpTable")
+    data.count()
+  }
+
+  def prepareTable(): Unit = {
+    spark.sql("create table " +
+      "tableOnePartitionColumn(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    spark.sql("create table " +
+      "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    spark.sql("create table " +
+      "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+  }
+
+  def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableOnePartitionColumn partition(part) " +

Review comment:
       can we pass table name as a parameter? Which is more robust and we don't need to worry about table name mismatch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506294887



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ *   2. build/sbt "sql/test:runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ *      Results will be written to
+ *      "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+  override def getSparkSession: SparkSession = {
+    SparkSession.builder().master("local[4]").getOrCreate()
+  }
+
+  def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+      part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+    val dataFrame = spark.range(0, numberRows, 1, 4)
+    val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+    val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+    val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+    val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+      .toDF("id", "part1", "part2", "part3")
+
+    data.createOrReplaceTempView("tmpTable")
+
+    spark.sql("create table " +
+      "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet")
+
+    spark.sql("insert overwrite table sourceTable " +
+      s"select id, " +
+      s"part1, " +
+      s"part2, " +
+      s"part3 " +
+      s"from tmpTable")
+
+    spark.catalog.dropTempView("tmpTable")
+    data.count()
+  }
+
+  def prepareTable(): Unit = {
+    spark.sql("create table " +
+      "tableOnePartitionColumn(i bigint, part bigint) " +
+      "using parquet partitioned by (part)")
+    spark.sql("create table " +
+      "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+      "using parquet partitioned by (part1, part2)")
+    spark.sql("create table " +
+      "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+      "using parquet partitioned by (part1, part2, part3)")
+  }
+
+  def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+    benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+      spark.sql("insert overwrite table " +
+        "tableOnePartitionColumn partition(part) " +

Review comment:
       can we pass table name as a parameter? Which is more robust.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r505137273



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.util.SerializableConfiguration
 
 
+
 /**
  * Simple metrics collected during an instance of [[FileFormatDataWriter]].
  * These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
  */
 case class BasicWriteTaskStats(
-    numPartitions: Int,
+    partitions: Seq[InternalRow],

Review comment:
       ok ~ busy with my own work, will give feedback later




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711456094


   **[Test build #129971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129971/testReport)** for PR 30026 at commit [`d630653`](https://github.com/apache/spark/commit/d6306530ff512678bdf5e683417f70250c07fca5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707775497


   **[Test build #129741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129741/testReport)** for PR 30026 at commit [`0769ac1`](https://github.com/apache/spark/commit/0769ac1fbe3b379fc6482e88095e57a895b149f2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711465962


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34579/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-712648787


   @cloud-fan Any other problems need to be fixed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r503945310



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.util.SerializableConfiguration
 
 
+
 /**
  * Simple metrics collected during an instance of [[FileFormatDataWriter]].
  * These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
  */
 case class BasicWriteTaskStats(
-    numPartitions: Int,
+    partitions: Seq[InternalRow],

Review comment:
       This increases the data size we need to transfer between executors and the driver. Do we have a microbenchmark to verify the impact?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r504061074



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.util.SerializableConfiguration
 
 
+
 /**
  * Simple metrics collected during an instance of [[FileFormatDataWriter]].
  * These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
  */
 case class BasicWriteTaskStats(
-    numPartitions: Int,
+    partitions: Seq[InternalRow],

Review comment:
       end-to-end performance of an INSERT query, with a partitioned table with 1 or 2 or 3 partitioned columns.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710096702






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709071052


   return size is partition num * shuffle num always can be millions level 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711606360






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709237332


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34436/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506148242



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
 
   override def processStats(stats: Seq[WriteTaskStats]): Unit = {
     val sparkContext = SparkContext.getActive.get
-    var numPartitions: Long = 0L
+    var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
     var numFiles: Long = 0L
     var totalNumBytes: Long = 0L
     var totalNumOutput: Long = 0L
 
     val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
 
     basicStats.foreach { summary =>
-      numPartitions += summary.numPartitions
+      partitionsSet ++= summary.partitions

Review comment:
       ditto, `partitionsSet.addAll(summary.partitions)` can only be used in Scala 2.13 too.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
 
 
   override def newPartition(partitionValues: InternalRow): Unit = {
-    numPartitions += 1
+    partitions = partitions :+ partitionValues

Review comment:
       `partitions.appended(partitionValues)` can only be used in Scala 2.13




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707663518






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711606360






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709391509


   **[Test build #129831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129831/testReport)** for PR 30026 at commit [`6127fa5`](https://github.com/apache/spark/commit/6127fa56abf72ba5112092b9000200934e199b59).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org