You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/13 09:32:21 UTC
[GitHub] [spark] LuciferYang opened a new pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
LuciferYang opened a new pull request #30026:
URL: https://github.com/apache/spark/pull/30026
### What changes were proposed in this pull request?
The purpose of this pr is to resolve SPARK-32978.
The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side.
The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct.
### Why are the changes needed?
The the number of dynamic part metric we display on the UI should be correct.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add a new test case refer to described in SPARK-32978
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234551
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714492579
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711456094
**[Test build #129971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129971/testReport)** for PR 30026 at commit [`d630653`](https://github.com/apache/spark/commit/d6306530ff512678bdf5e683417f70250c07fca5).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234551
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709408879
**[Test build #129843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129843/testReport)** for PR 30026 at commit [`9edf8ad`](https://github.com/apache/spark/commit/9edf8ad99b32ea4b198d13ab197330ff875ba70c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709363138
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129828/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506446756
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
override def newPartition(partitionValues: InternalRow): Unit = {
- numPartitions += 1
+ partitions = partitions :+ partitionValues
Review comment:
Address 6d80788 change to use `ArrayBuffer` to store `partitions ` and there change to `partitions.appended(partitionValues)`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507352925
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
+
+ // Wait for listener to finish computing the metrics for the executions.
+ while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+ statusStore.executionsList().last.metricValues == null) {
+ Thread.sleep(100)
+ }
+
+ // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,
Review comment:
> why there are 4? is it because of collect?
Yes, without `.collect` should be 2.
> BTW can we call val oldExecutionsSize = statusStore.executionsList().size after create table? then we just need to wait for one SQLExecutionUIData.
Address 15c7519 fix this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506291180
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
override def newPartition(partitionValues: InternalRow): Unit = {
- numPartitions += 1
+ partitions = partitions :+ partitionValues
Review comment:
or `add`? doesn't scala 2.12 have java-like method to add elements?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714575551
**[Test build #130152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130152/testReport)** for PR 30026 at commit [`4f47e20`](https://github.com/apache/spark/commit/4f47e205f2d130cf432b1a6c42589f919f2852e5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710010756
**[Test build #129885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129885/testReport)** for PR 30026 at commit [`5769222`](https://github.com/apache/spark/commit/57692222be0ec45243c9fc574dd6ff06c87f9024).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506487800
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+ data.write.saveAsTable(sourceTable)
+ data.count()
+ }
+
+ def writeOnePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part) " +
Review comment:
nit: this can be merged into the previous line
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+ data.write.saveAsTable(sourceTable)
+ data.count()
+ }
+
+ def writeOnePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part) " +
+ "select id, part1 as part from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part1, part2) " +
Review comment:
ditto
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509930936
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
Review comment:
`withTable` in `DataSourceWriteBenchmark ` is used(Line 91) to clean up the created table resources after benchmark
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
Review comment:
`withTable` method in `DataSourceWriteBenchmark ` is used(Line 91) to clean up the created table resources after benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277
Address 724eee6 add a simple microbenchmark
EDIT: Address 9edf8ad refactor microbenchmark to test more dynamic partitions number with JVM options `-Xmx4g -Xms4g`:
**With this pr** the result is :
```
Running benchmark: dynamic insert table benchmark, totalRows = 200000
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 10421 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 49308 ms
Running case: three partition columns, 2000 partitions
Stopped after 2 iterations, 173533 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark, totalRows = 200000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
one partition column, 100 partitions 4946 5211 374 0.0 24731.0 1.0X
two partition columns, 500 partitions 22929 24654 2440 0.0 114645.4 0.2X
three partition columns, 2000 partitions 82092 86767 2609 0.0 410461.3 0.1X
```
**Without this pr** the result is :
```
Running benchmark: dynamic insert table benchmark, totalRows = 200000
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 10252 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 45089 ms
Running case: three partition columns, 2000 partitions
Stopped after 2 iterations, 198925 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark, totalRows = 200000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
one partition column, 100 partitions 4840 5126 404 0.0 24201.4 1.0X
two partition columns, 500 partitions 20978 22545 2215 0.0 104892.0 0.2X
three partition columns, 2000 partitions 86858 99463 2043 0.0 434288.8 0.1X
```
cc @cloud-fan seems no essential difference, It looks better than expected
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710060401
**[Test build #129898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129898/testReport)** for PR 30026 at commit [`6d80788`](https://github.com/apache/spark/commit/6d8078814886e94bd54d0058174a47e5ff607723).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714437334
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34759/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472782
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709932689
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34490/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710011549
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r503951004
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
import org.apache.spark.util.SerializableConfiguration
+
/**
* Simple metrics collected during an instance of [[FileFormatDataWriter]].
* These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
*/
case class BasicWriteTaskStats(
- numPartitions: Int,
+ partitions: Seq[InternalRow],
Review comment:
A benchmark for latency or memory usage? I can try it ~
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709449144
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710303576
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129898/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707655719
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34347/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707652650
@AngersZhuuuu That's what I'm worried about :( , but I didn't think of a better way for the moment
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506489291
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
Review comment:
nit: we don't need to call `.collect` to trigger DDL/DML commands. `sql(...)` already does the job.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714290148
**[Test build #130137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130137/testReport)** for PR 30026 at commit [`fa01951`](https://github.com/apache/spark/commit/fa019515b88c0571394a4207226f33631aa04855).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709217970
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34434/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506151340
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
override def processStats(stats: Seq[WriteTaskStats]): Unit = {
val sparkContext = SparkContext.getActive.get
- var numPartitions: Long = 0L
+ var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
var numFiles: Long = 0L
var totalNumBytes: Long = 0L
var totalNumOutput: Long = 0L
val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
basicStats.foreach { summary =>
- numPartitions += summary.numPartitions
+ partitionsSet ++= summary.partitions
numFiles += summary.numFiles
totalNumBytes += summary.numBytes
totalNumOutput += summary.numRows
}
+ val numPartitions: Long = partitionsSet.size
Review comment:
Address 5769222 fix this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506112571
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
override def processStats(stats: Seq[WriteTaskStats]): Unit = {
val sparkContext = SparkContext.getActive.get
- var numPartitions: Long = 0L
+ var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
var numFiles: Long = 0L
var totalNumBytes: Long = 0L
var totalNumOutput: Long = 0L
val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
basicStats.foreach { summary =>
- numPartitions += summary.numPartitions
+ partitionsSet ++= summary.partitions
numFiles += summary.numFiles
totalNumBytes += summary.numBytes
totalNumOutput += summary.numRows
}
+ val numPartitions: Long = partitionsSet.size
Review comment:
nit: it's only used once, we can inline it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711715975
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714515351
thanks, merging to master!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277
Address 724eee6 add a simple microbenchmark/
**With this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: insert table with 10000 rows, one partition column, 10 partitions
Stopped after 2 iterations, 2134 ms
Running case: insert table with 10000 rows, one partition column, 50 partitions
Stopped after 2 iterations, 7206 ms
Running case: insert table with 10000 rows, one partition column, 100 partitions
Stopped after 2 iterations, 9105 ms
Running case: insert table with 10000 rows, one partition column, 200 partitions
Stopped after 2 iterations, 14778 ms
Running case: insert table with 10000 rows, one partition column, 500 partitions
Stopped after 2 iterations, 42992 ms
Running case: insert table with 10000 rows, two partition columns, 10 partitions
Stopped after 2 iterations, 2331 ms
Running case: insert table with 10000 rows, two partition columns, 50 partitions
Stopped after 2 iterations, 6768 ms
Running case: insert table with 10000 rows, two partition columns, 100 partitions
Stopped after 2 iterations, 9274 ms
Running case: insert table with 10000 rows, two partition columns, 200 partitions
Stopped after 2 iterations, 17487 ms
Running case: insert table with 10000 rows, two partition columns, 500 partitions
Stopped after 2 iterations, 54044 ms
Running case: insert table with 10000 rows, three partition columns, 10 partitions
Stopped after 2 iterations, 2368 ms
Running case: insert table with 10000 rows, three partition columns, 50 partitions
Stopped after 2 iterations, 5538 ms
Running case: insert table with 10000 rows, three partition columns, 100 partitions
Stopped after 2 iterations, 11687 ms
Running case: insert table with 10000 rows, three partition columns, 200 partitions
Stopped after 2 iterations, 22371 ms
Running case: insert table with 10000 rows, three partition columns, 500 partitions
Stopped after 2 iterations, 55828 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
insert table with 10000 rows, one partition column, 10 partitions 922 1067 206 0.0 92182.2 1.0X
insert table with 10000 rows, one partition column, 50 partitions 3265 3603 478 0.0 326535.4 0.3X
insert table with 10000 rows, one partition column, 100 partitions 4390 4553 230 0.0 438988.3 0.2X
insert table with 10000 rows, one partition column, 200 partitions 6585 7389 1137 0.0 658477.7 0.1X
insert table with 10000 rows, one partition column, 500 partitions 20220 21496 1805 0.0 2022011.5 0.0X
insert table with 10000 rows, two partition columns, 10 partitions 1114 1166 72 0.0 111432.2 0.8X
insert table with 10000 rows, two partition columns, 50 partitions 2467 3384 1297 0.0 246670.3 0.4X
insert table with 10000 rows, two partition columns, 100 partitions 4559 4637 110 0.0 455904.3 0.2X
insert table with 10000 rows, two partition columns, 200 partitions 8631 8744 159 0.0 863130.8 0.1X
insert table with 10000 rows, two partition columns, 500 partitions 23806 27022 1498 0.0 2380574.6 0.0X
insert table with 10000 rows, three partition columns, 10 partitions 1096 1184 125 0.0 109639.4 0.8X
insert table with 10000 rows, three partition columns, 50 partitions 2694 2769 107 0.0 269364.4 0.3X
insert table with 10000 rows, three partition columns, 100 partitions 5701 5844 202 0.0 570137.3 0.2X
insert table with 10000 rows, three partition columns, 200 partitions 11105 11186 115 0.0 1110452.3 0.1X
insert table with 10000 rows, three partition columns, 500 partitions 26978 27914 1324 0.0 2697786.6 0.0X
```
**Without this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: insert table with 10000 rows, one partition column, 10 partitions
Stopped after 3 iterations, 2356 ms
Running case: insert table with 10000 rows, one partition column, 50 partitions
Stopped after 2 iterations, 6328 ms
Running case: insert table with 10000 rows, one partition column, 100 partitions
Stopped after 2 iterations, 8942 ms
Running case: insert table with 10000 rows, one partition column, 200 partitions
Stopped after 2 iterations, 17401 ms
Running case: insert table with 10000 rows, one partition column, 500 partitions
Stopped after 2 iterations, 43009 ms
Running case: insert table with 10000 rows, two partition columns, 10 partitions
Stopped after 3 iterations, 2024 ms
Running case: insert table with 10000 rows, two partition columns, 50 partitions
Stopped after 2 iterations, 4862 ms
Running case: insert table with 10000 rows, two partition columns, 100 partitions
Stopped after 2 iterations, 11229 ms
Running case: insert table with 10000 rows, two partition columns, 200 partitions
Stopped after 2 iterations, 18244 ms
Running case: insert table with 10000 rows, two partition columns, 500 partitions
Stopped after 2 iterations, 54922 ms
Running case: insert table with 10000 rows, three partition columns, 10 partitions
Stopped after 3 iterations, 2173 ms
Running case: insert table with 10000 rows, three partition columns, 50 partitions
Stopped after 2 iterations, 5660 ms
Running case: insert table with 10000 rows, three partition columns, 100 partitions
Stopped after 2 iterations, 14925 ms
Running case: insert table with 10000 rows, three partition columns, 200 partitions
Stopped after 2 iterations, 28378 ms
Running case: insert table with 10000 rows, three partition columns, 500 partitions
Stopped after 2 iterations, 59941 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
insert table with 10000 rows, one partition column, 10 partitions 715 785 98 0.0 71450.9 1.0X
insert table with 10000 rows, one partition column, 50 partitions 3132 3164 46 0.0 313201.2 0.2X
insert table with 10000 rows, one partition column, 100 partitions 4375 4471 136 0.0 437546.1 0.2X
insert table with 10000 rows, one partition column, 200 partitions 7968 8701 1035 0.0 796846.9 0.1X
insert table with 10000 rows, one partition column, 500 partitions 19208 21505 NaN 0.0 1920778.4 0.0X
insert table with 10000 rows, two partition columns, 10 partitions 600 675 66 0.0 60027.5 1.2X
insert table with 10000 rows, two partition columns, 50 partitions 2372 2431 83 0.0 237244.2 0.3X
insert table with 10000 rows, two partition columns, 100 partitions 5385 5615 325 0.0 538471.6 0.1X
insert table with 10000 rows, two partition columns, 200 partitions 8496 9122 885 0.0 849591.8 0.1X
insert table with 10000 rows, two partition columns, 500 partitions 25747 27461 2424 0.0 2574722.7 0.0X
insert table with 10000 rows, three partition columns, 10 partitions 687 725 35 0.0 68748.8 1.0X
insert table with 10000 rows, three partition columns, 50 partitions 2757 2830 104 0.0 275692.0 0.3X
insert table with 10000 rows, three partition columns, 100 partitions 6336 7463 1594 0.0 633568.3 0.1X
insert table with 10000 rows, three partition columns, 200 partitions 14046 14189 202 0.0 1404645.4 0.1X
insert table with 10000 rows, three partition columns, 500 partitions 26749 29971 1520 0.0 2674929.0 0.0X
```
cc @cloud-fan seems no essential difference
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710079144
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34503/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507311420
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
Review comment:
some other problems without call `.collect`, let me re-check this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472782
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509930936
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
Review comment:
`withTable` in `DataSourceWriteBenchmark ` is used to clean up the created table resources after benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707663493
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34347/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714491110
**[Test build #130137 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130137/testReport)** for PR 30026 at commit [`fa01951`](https://github.com/apache/spark/commit/fa019515b88c0571394a4207226f33631aa04855).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277
Address 724eee6 add a simple microbenchmark
**With this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: insert table with 10000 rows, one partition column, 10 partitions
Stopped after 2 iterations, 2134 ms
Running case: insert table with 10000 rows, one partition column, 50 partitions
Stopped after 2 iterations, 7206 ms
Running case: insert table with 10000 rows, one partition column, 100 partitions
Stopped after 2 iterations, 9105 ms
Running case: insert table with 10000 rows, one partition column, 200 partitions
Stopped after 2 iterations, 14778 ms
Running case: insert table with 10000 rows, one partition column, 500 partitions
Stopped after 2 iterations, 42992 ms
Running case: insert table with 10000 rows, two partition columns, 10 partitions
Stopped after 2 iterations, 2331 ms
Running case: insert table with 10000 rows, two partition columns, 50 partitions
Stopped after 2 iterations, 6768 ms
Running case: insert table with 10000 rows, two partition columns, 100 partitions
Stopped after 2 iterations, 9274 ms
Running case: insert table with 10000 rows, two partition columns, 200 partitions
Stopped after 2 iterations, 17487 ms
Running case: insert table with 10000 rows, two partition columns, 500 partitions
Stopped after 2 iterations, 54044 ms
Running case: insert table with 10000 rows, three partition columns, 10 partitions
Stopped after 2 iterations, 2368 ms
Running case: insert table with 10000 rows, three partition columns, 50 partitions
Stopped after 2 iterations, 5538 ms
Running case: insert table with 10000 rows, three partition columns, 100 partitions
Stopped after 2 iterations, 11687 ms
Running case: insert table with 10000 rows, three partition columns, 200 partitions
Stopped after 2 iterations, 22371 ms
Running case: insert table with 10000 rows, three partition columns, 500 partitions
Stopped after 2 iterations, 55828 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
insert table with 10000 rows, one partition column, 10 partitions 922 1067 206 0.0 92182.2 1.0X
insert table with 10000 rows, one partition column, 50 partitions 3265 3603 478 0.0 326535.4 0.3X
insert table with 10000 rows, one partition column, 100 partitions 4390 4553 230 0.0 438988.3 0.2X
insert table with 10000 rows, one partition column, 200 partitions 6585 7389 1137 0.0 658477.7 0.1X
insert table with 10000 rows, one partition column, 500 partitions 20220 21496 1805 0.0 2022011.5 0.0X
insert table with 10000 rows, two partition columns, 10 partitions 1114 1166 72 0.0 111432.2 0.8X
insert table with 10000 rows, two partition columns, 50 partitions 2467 3384 1297 0.0 246670.3 0.4X
insert table with 10000 rows, two partition columns, 100 partitions 4559 4637 110 0.0 455904.3 0.2X
insert table with 10000 rows, two partition columns, 200 partitions 8631 8744 159 0.0 863130.8 0.1X
insert table with 10000 rows, two partition columns, 500 partitions 23806 27022 1498 0.0 2380574.6 0.0X
insert table with 10000 rows, three partition columns, 10 partitions 1096 1184 125 0.0 109639.4 0.8X
insert table with 10000 rows, three partition columns, 50 partitions 2694 2769 107 0.0 269364.4 0.3X
insert table with 10000 rows, three partition columns, 100 partitions 5701 5844 202 0.0 570137.3 0.2X
insert table with 10000 rows, three partition columns, 200 partitions 11105 11186 115 0.0 1110452.3 0.1X
insert table with 10000 rows, three partition columns, 500 partitions 26978 27914 1324 0.0 2697786.6 0.0X
```
**Without this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: insert table with 10000 rows, one partition column, 10 partitions
Stopped after 3 iterations, 2356 ms
Running case: insert table with 10000 rows, one partition column, 50 partitions
Stopped after 2 iterations, 6328 ms
Running case: insert table with 10000 rows, one partition column, 100 partitions
Stopped after 2 iterations, 8942 ms
Running case: insert table with 10000 rows, one partition column, 200 partitions
Stopped after 2 iterations, 17401 ms
Running case: insert table with 10000 rows, one partition column, 500 partitions
Stopped after 2 iterations, 43009 ms
Running case: insert table with 10000 rows, two partition columns, 10 partitions
Stopped after 3 iterations, 2024 ms
Running case: insert table with 10000 rows, two partition columns, 50 partitions
Stopped after 2 iterations, 4862 ms
Running case: insert table with 10000 rows, two partition columns, 100 partitions
Stopped after 2 iterations, 11229 ms
Running case: insert table with 10000 rows, two partition columns, 200 partitions
Stopped after 2 iterations, 18244 ms
Running case: insert table with 10000 rows, two partition columns, 500 partitions
Stopped after 2 iterations, 54922 ms
Running case: insert table with 10000 rows, three partition columns, 10 partitions
Stopped after 3 iterations, 2173 ms
Running case: insert table with 10000 rows, three partition columns, 50 partitions
Stopped after 2 iterations, 5660 ms
Running case: insert table with 10000 rows, three partition columns, 100 partitions
Stopped after 2 iterations, 14925 ms
Running case: insert table with 10000 rows, three partition columns, 200 partitions
Stopped after 2 iterations, 28378 ms
Running case: insert table with 10000 rows, three partition columns, 500 partitions
Stopped after 2 iterations, 59941 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
insert table with 10000 rows, one partition column, 10 partitions 715 785 98 0.0 71450.9 1.0X
insert table with 10000 rows, one partition column, 50 partitions 3132 3164 46 0.0 313201.2 0.2X
insert table with 10000 rows, one partition column, 100 partitions 4375 4471 136 0.0 437546.1 0.2X
insert table with 10000 rows, one partition column, 200 partitions 7968 8701 1035 0.0 796846.9 0.1X
insert table with 10000 rows, one partition column, 500 partitions 19208 21505 NaN 0.0 1920778.4 0.0X
insert table with 10000 rows, two partition columns, 10 partitions 600 675 66 0.0 60027.5 1.2X
insert table with 10000 rows, two partition columns, 50 partitions 2372 2431 83 0.0 237244.2 0.3X
insert table with 10000 rows, two partition columns, 100 partitions 5385 5615 325 0.0 538471.6 0.1X
insert table with 10000 rows, two partition columns, 200 partitions 8496 9122 885 0.0 849591.8 0.1X
insert table with 10000 rows, two partition columns, 500 partitions 25747 27461 2424 0.0 2574722.7 0.0X
insert table with 10000 rows, three partition columns, 10 partitions 687 725 35 0.0 68748.8 1.0X
insert table with 10000 rows, three partition columns, 50 partitions 2757 2830 104 0.0 275692.0 0.3X
insert table with 10000 rows, three partition columns, 100 partitions 6336 7463 1594 0.0 633568.3 0.1X
insert table with 10000 rows, three partition columns, 200 partitions 14046 14189 202 0.0 1404645.4 0.1X
insert table with 10000 rows, three partition columns, 500 partitions 26749 29971 1520 0.0 2674929.0 0.0X
```
cc @cloud-fan seems no essential difference
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711480889
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34581/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709868109
> return size is partition num * shuffle num always can be millions level
I thought about it. If a table has 10k partitions, it's unlikely that each write task touches all the 10k partitions. So the total size is not that large.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714335764
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34744/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709449095
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34449/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506295439
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
+
+ spark.sql("create table " +
+ "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet")
+
+ spark.sql("insert overwrite table sourceTable " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from tmpTable")
+
+ spark.catalog.dropTempView("tmpTable")
+ data.count()
+ }
+
+ def prepareTable(): Unit = {
+ spark.sql("create table " +
+ "tableOnePartitionColumn(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ spark.sql("create table " +
+ "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ spark.sql("create table " +
+ "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ }
+
+ def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableOnePartitionColumn partition(part) " +
+ s"select id, " +
+ s"part1 as part " +
+ s"from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableTwoPartitionColumn partition(part1, part2) " +
+ s"select id, " +
+ s"part1, " +
+ s"part2 " +
+ s"from sourceTable")
+ }
+ }
+
+ def writeThreePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableThreePartitionColumn partition(part1, part2, part3) " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from sourceTable")
+ }
+ }
+
+ override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+ val sourceTable = "sourceTable"
+ val tableOnePartitionColumn = "tableOnePartitionColumn"
Review comment:
it can be simpler `val onePartTable = ...`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707619254
cc @wangyum , the current result should be correct, but will increase memory and network pressure
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507352925
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
+
+ // Wait for listener to finish computing the metrics for the executions.
+ while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+ statusStore.executionsList().last.metricValues == null) {
+ Thread.sleep(100)
+ }
+
+ // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,
Review comment:
Address 15c7519 fix this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711488360
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509898355
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
Review comment:
We need to run it and commit the benchmark result to the codebase.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709932719
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506491449
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
+
+ // Wait for listener to finish computing the metrics for the executions.
+ while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+ statusStore.executionsList().last.metricValues == null) {
+ Thread.sleep(100)
+ }
+
+ // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,
Review comment:
why there are 4? is it because of `collect`?
BTW can we call `val oldExecutionsSize = statusStore.executionsList().size` after create table? then we just need to wait for one `SQLExecutionUIData`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710088277
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509899679
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
Review comment:
do we have to override it? The default one-thread spark session is better to reason about benchmark results.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709438264
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34449/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714530492
thx~ @cloud-fan @wangyum @AngersZhuuuu
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709555923
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710011549
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #30026:
URL: https://github.com/apache/spark/pull/30026
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711713292
**[Test build #129973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129973/testReport)** for PR 30026 at commit [`15c7519`](https://github.com/apache/spark/commit/15c751961e43ea744f01f9f3264e487cb0254c36).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710303567
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506293143
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
Review comment:
BTW is it the same as `data.write.saveAsTable("sourceTable")`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714451930
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34759/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506110093
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
import org.apache.spark.util.SerializableConfiguration
+
Review comment:
unnecessary change.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714409433
**[Test build #130152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130152/testReport)** for PR 30026 at commit [`4f47e20`](https://github.com/apache/spark/commit/4f47e205f2d130cf432b1a6c42589f919f2852e5).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709049996
**[Test build #129828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129828/testReport)** for PR 30026 at commit [`724eee6`](https://github.com/apache/spark/commit/724eee6acfd3754ab14d83d132df492700988cfc).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509899465
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
Review comment:
Seems we don't use the utils methods in `DataSourceWriteBenchmark`. I think we can implement `SqlBasedBenchmark` directly.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714577119
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506292321
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
Review comment:
`tmpTable` -> `tmpView`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509930936
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
Review comment:
`withTable` in `DataSourceWriteBenchmark ` is used(line 91) to clean up the created table resources after benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709899037
**[Test build #129885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129885/testReport)** for PR 30026 at commit [`5769222`](https://github.com/apache/spark/commit/57692222be0ec45243c9fc574dd6ff06c87f9024).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709237402
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234528
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34434/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472769
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34579/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711488360
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714318647
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34744/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707621997
**[Test build #129741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129741/testReport)** for PR 30026 at commit [`0769ac1`](https://github.com/apache/spark/commit/0769ac1fbe3b379fc6482e88095e57a895b149f2).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709363125
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506448136
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
+
+ spark.sql("create table " +
+ "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet")
+
+ spark.sql("insert overwrite table sourceTable " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from tmpTable")
+
+ spark.catalog.dropTempView("tmpTable")
+ data.count()
+ }
+
+ def prepareTable(): Unit = {
+ spark.sql("create table " +
+ "tableOnePartitionColumn(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ spark.sql("create table " +
+ "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ spark.sql("create table " +
+ "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ }
+
+ def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableOnePartitionColumn partition(part) " +
Review comment:
done~
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
+
+ spark.sql("create table " +
+ "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet")
+
+ spark.sql("insert overwrite table sourceTable " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from tmpTable")
+
+ spark.catalog.dropTempView("tmpTable")
+ data.count()
+ }
+
+ def prepareTable(): Unit = {
+ spark.sql("create table " +
+ "tableOnePartitionColumn(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ spark.sql("create table " +
+ "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ spark.sql("create table " +
+ "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ }
+
+ def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableOnePartitionColumn partition(part) " +
+ s"select id, " +
+ s"part1 as part " +
+ s"from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableTwoPartitionColumn partition(part1, part2) " +
+ s"select id, " +
+ s"part1, " +
+ s"part2 " +
+ s"from sourceTable")
+ }
+ }
+
+ def writeThreePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableThreePartitionColumn partition(part1, part2, part3) " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from sourceTable")
+ }
+ }
+
+ override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+ val sourceTable = "sourceTable"
+ val tableOnePartitionColumn = "tableOnePartitionColumn"
Review comment:
done ~
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506295439
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
+
+ spark.sql("create table " +
+ "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet")
+
+ spark.sql("insert overwrite table sourceTable " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from tmpTable")
+
+ spark.catalog.dropTempView("tmpTable")
+ data.count()
+ }
+
+ def prepareTable(): Unit = {
+ spark.sql("create table " +
+ "tableOnePartitionColumn(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ spark.sql("create table " +
+ "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ spark.sql("create table " +
+ "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ }
+
+ def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableOnePartitionColumn partition(part) " +
+ s"select id, " +
+ s"part1 as part " +
+ s"from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableTwoPartitionColumn partition(part1, part2) " +
+ s"select id, " +
+ s"part1, " +
+ s"part2 " +
+ s"from sourceTable")
+ }
+ }
+
+ def writeThreePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableThreePartitionColumn partition(part1, part2, part3) " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from sourceTable")
+ }
+ }
+
+ override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+ val sourceTable = "sourceTable"
+ val tableOnePartitionColumn = "tableOnePartitionColumn"
Review comment:
it can be simpler `val onePartColTable = ...`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710088277
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277
Address 724eee6 add a simple microbenchmark
EDIT: Address 9edf8ad refactor microbenchmark to test more dynamic partitions number:
**With this pr** the result is :
```
Running benchmark: dynamic insert table benchmark, totalRows = 200000
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 10421 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 49308 ms
Running case: three partition columns, 2000 partitions
Stopped after 2 iterations, 173533 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark, totalRows = 200000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
one partition column, 100 partitions 4946 5211 374 0.0 24731.0 1.0X
two partition columns, 500 partitions 22929 24654 2440 0.0 114645.4 0.2X
three partition columns, 2000 partitions 82092 86767 2609 0.0 410461.3 0.1X
```
**Without this pr** the result is :
```
Running benchmark: dynamic insert table benchmark, totalRows = 200000
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 10252 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 45089 ms
Running case: three partition columns, 2000 partitions
Stopped after 2 iterations, 198925 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark, totalRows = 200000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
one partition column, 100 partitions 4840 5126 404 0.0 24201.4 1.0X
two partition columns, 500 partitions 20978 22545 2215 0.0 104892.0 0.2X
three partition columns, 2000 partitions 86858 99463 2043 0.0 434288.8 0.1X
```
cc @cloud-fan seems no essential difference
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277
Address 724eee6 add a simple microbenchmark
**With this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: one partition column, 10 partitions
Stopped after 2 iterations, 2134 ms
Running case: one partition column, 50 partitions
Stopped after 2 iterations, 7206 ms
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 9105 ms
Running case: one partition column, 200 partitions
Stopped after 2 iterations, 14778 ms
Running case: one partition column, 500 partitions
Stopped after 2 iterations, 42992 ms
Running case: two partition columns, 10 partitions
Stopped after 2 iterations, 2331 ms
Running case: two partition columns, 50 partitions
Stopped after 2 iterations, 6768 ms
Running case: two partition columns, 100 partitions
Stopped after 2 iterations, 9274 ms
Running case: two partition columns, 200 partitions
Stopped after 2 iterations, 17487 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 54044 ms
Running case: three partition columns, 10 partitions
Stopped after 2 iterations, 2368 ms
Running case: three partition columns, 50 partitions
Stopped after 2 iterations, 5538 ms
Running case: three partition columns, 100 partitions
Stopped after 2 iterations, 11687 ms
Running case: three partition columns, 200 partitions
Stopped after 2 iterations, 22371 ms
Running case: three partition columns, 500 partitions
Stopped after 2 iterations, 55828 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
one partition column, 10 partitions 922 1067 206 0.0 92182.2 1.0X
one partition column, 50 partitions 3265 3603 478 0.0 326535.4 0.3X
one partition column, 100 partitions 4390 4553 230 0.0 438988.3 0.2X
one partition column, 200 partitions 6585 7389 1137 0.0 658477.7 0.1X
one partition column, 500 partitions 20220 21496 1805 0.0 2022011.5 0.0X
two partition columns, 10 partitions 1114 1166 72 0.0 111432.2 0.8X
two partition columns, 50 partitions 2467 3384 1297 0.0 246670.3 0.4X
two partition columns, 100 partitions 4559 4637 110 0.0 455904.3 0.2X
two partition columns, 200 partitions 8631 8744 159 0.0 863130.8 0.1X
two partition columns, 500 partitions 23806 27022 1498 0.0 2380574.6 0.0X
three partition columns, 10 partitions 1096 1184 125 0.0 109639.4 0.8X
three partition columns, 50 partitions 2694 2769 107 0.0 269364.4 0.3X
three partition columns, 100 partitions 5701 5844 202 0.0 570137.3 0.2X
three partition columns, 200 partitions 11105 11186 115 0.0 1110452.3 0.1X
three partition columns, 500 partitions 26978 27914 1324 0.0 2697786.6 0.0X
```
**Without this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: one partition column, 10 partitions
Stopped after 3 iterations, 2356 ms
Running case: one partition column, 50 partitions
Stopped after 2 iterations, 6328 ms
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 8942 ms
Running case: one partition column, 200 partitions
Stopped after 2 iterations, 17401 ms
Running case: one partition column, 500 partitions
Stopped after 2 iterations, 43009 ms
Running case: two partition columns, 10 partitions
Stopped after 3 iterations, 2024 ms
Running case: two partition columns, 50 partitions
Stopped after 2 iterations, 4862 ms
Running case: two partition columns, 100 partitions
Stopped after 2 iterations, 11229 ms
Running case: two partition columns, 200 partitions
Stopped after 2 iterations, 18244 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 54922 ms
Running case: three partition columns, 10 partitions
Stopped after 3 iterations, 2173 ms
Running case: three partition columns, 50 partitions
Stopped after 2 iterations, 5660 ms
Running case: three partition columns, 100 partitions
Stopped after 2 iterations, 14925 ms
Running case: three partition columns, 200 partitions
Stopped after 2 iterations, 28378 ms
Running case: three partition columns, 500 partitions
Stopped after 2 iterations, 59941 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
one partition column, 10 partitions 715 785 98 0.0 71450.9 1.0X
one partition column, 50 partitions 3132 3164 46 0.0 313201.2 0.2X
one partition column, 100 partitions 4375 4471 136 0.0 437546.1 0.2X
one partition column, 200 partitions 7968 8701 1035 0.0 796846.9 0.1X
one partition column, 500 partitions 19208 21505 NaN 0.0 1920778.4 0.0X
two partition columns, 10 partitions 600 675 66 0.0 60027.5 1.2X
two partition columns, 50 partitions 2372 2431 83 0.0 237244.2 0.3X
two partition columns, 100 partitions 5385 5615 325 0.0 538471.6 0.1X
two partition columns, 200 partitions 8496 9122 885 0.0 849591.8 0.1X
two partition columns, 500 partitions 25747 27461 2424 0.0 2574722.7 0.0X
three partition columns, 10 partitions 687 725 35 0.0 68748.8 1.0X
three partition columns, 50 partitions 2757 2830 104 0.0 275692.0 0.3X
three partition columns, 100 partitions 6336 7463 1594 0.0 633568.3 0.1X
three partition columns, 200 partitions 14046 14189 202 0.0 1404645.4 0.1X
three partition columns, 500 partitions 26749 29971 1520 0.0 2674929.0 0.0X
```
cc @cloud-fan seems no essential difference
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r510065629
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
Review comment:
Add 4f47e20 upload benchmark result file
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714409433
**[Test build #130152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130152/testReport)** for PR 30026 at commit [`4f47e20`](https://github.com/apache/spark/commit/4f47e205f2d130cf432b1a6c42589f919f2852e5).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506488151
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+ data.write.saveAsTable(sourceTable)
+ data.count()
+ }
+
+ def writeOnePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part) " +
+ "select id, part1 as part from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part1, part2) " +
+ "select id, part1, part2 from sourceTable")
+ }
+ }
+
+ def writeThreePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part1, part2, part3) " +
Review comment:
ditto
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714335785
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709554675
**[Test build #129843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129843/testReport)** for PR 30026 at commit [`9edf8ad`](https://github.com/apache/spark/commit/9edf8ad99b32ea4b198d13ab197330ff875ba70c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711469815
**[Test build #129973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129973/testReport)** for PR 30026 at commit [`15c7519`](https://github.com/apache/spark/commit/15c751961e43ea744f01f9f3264e487cb0254c36).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709237402
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711469815
**[Test build #129973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129973/testReport)** for PR 30026 at commit [`15c7519`](https://github.com/apache/spark/commit/15c751961e43ea744f01f9f3264e487cb0254c36).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709078492
**[Test build #129831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129831/testReport)** for PR 30026 at commit [`6127fa5`](https://github.com/apache/spark/commit/6127fa56abf72ba5112092b9000200934e199b59).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707640761
I have thought to do like this too, but if partition num is too big, return partition value will occupy too many memory.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776740
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129741/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506112213
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
override def processStats(stats: Seq[WriteTaskStats]): Unit = {
val sparkContext = SparkContext.getActive.get
- var numPartitions: Long = 0L
+ var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
var numFiles: Long = 0L
var totalNumBytes: Long = 0L
var totalNumOutput: Long = 0L
val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
basicStats.foreach { summary =>
- numPartitions += summary.numPartitions
+ partitionsSet ++= summary.partitions
Review comment:
ditto, `partitionsSet.addAll(summary.partitions)`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710096679
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34504/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710088262
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34503/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506111562
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
override def newPartition(partitionValues: InternalRow): Unit = {
- numPartitions += 1
+ partitions = partitions :+ partitionValues
Review comment:
this looks like appending a immutable collection. Can we be more explicit? e.g. `partitions.append(partitionValues)`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707621997
**[Test build #129741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129741/testReport)** for PR 30026 at commit [`0769ac1`](https://github.com/apache/spark/commit/0769ac1fbe3b379fc6482e88095e57a895b149f2).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710303567
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711604228
**[Test build #129971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129971/testReport)** for PR 30026 at commit [`d630653`](https://github.com/apache/spark/commit/d6306530ff512678bdf5e683417f70250c07fca5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709393180
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709393180
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710096702
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507352925
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
+
+ // Wait for listener to finish computing the metrics for the executions.
+ while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+ statusStore.executionsList().last.metricValues == null) {
+ Thread.sleep(100)
+ }
+
+ // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,
Review comment:
> why there are 4? is it because of collect?
Yes, without `.collect` should be 2.
> BTW can we call val oldExecutionsSize = statusStore.executionsList().size after create table? then we just need to wait for one SQLExecutionUIData.
@cloud-fan Address 15c7519 fix this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714290148
**[Test build #130137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130137/testReport)** for PR 30026 at commit [`fa01951`](https://github.com/apache/spark/commit/fa019515b88c0571394a4207226f33631aa04855).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709061277
Address 724eee6 add a simple microbenchmark
**With this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: one partition column, 10 partitions
Stopped after 2 iterations, 2134 ms
Running case: one partition column, 50 partitions
Stopped after 2 iterations, 7206 ms
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 9105 ms
Running case: one partition column, 200 partitions
Stopped after 2 iterations, 14778 ms
Running case: one partition column, 500 partitions
Stopped after 2 iterations, 42992 ms
Running case: two partition columns, 10 partitions
Stopped after 2 iterations, 2331 ms
Running case: two partition columns, 50 partitions
Stopped after 2 iterations, 6768 ms
Running case: two partition columns, 100 partitions
Stopped after 2 iterations, 9274 ms
Running case: two partition columns, 200 partitions
Stopped after 2 iterations, 17487 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 54044 ms
Running case: three partition columns, 10 partitions
Stopped after 2 iterations, 2368 ms
Running case: three partition columns, 50 partitions
Stopped after 2 iterations, 5538 ms
Running case: three partition columns, 100 partitions
Stopped after 2 iterations, 11687 ms
Running case: three partition columns, 200 partitions
Stopped after 2 iterations, 22371 ms
Running case: three partition columns, 500 partitions
Stopped after 2 iterations, 55828 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
one partition column, 10 partitions 922 1067 206 0.0 92182.2 1.0X
one partition column, 50 partitions 3265 3603 478 0.0 326535.4 0.3X
one partition column, 100 partitions 4390 4553 230 0.0 438988.3 0.2X
one partition column, 200 partitions 6585 7389 1137 0.0 658477.7 0.1X
one partition column, 500 partitions 20220 21496 1805 0.0 2022011.5 0.0X
two partition columns, 10 partitions 1114 1166 72 0.0 111432.2 0.8X
two partition columns, 50 partitions 2467 3384 1297 0.0 246670.3 0.4X
two partition columns, 100 partitions 4559 4637 110 0.0 455904.3 0.2X
two partition columns, 200 partitions 8631 8744 159 0.0 863130.8 0.1X
two partition columns, 500 partitions 23806 27022 1498 0.0 2380574.6 0.0X
three partition columns, 10 partitions 1096 1184 125 0.0 109639.4 0.8X
three partition columns, 50 partitions 2694 2769 107 0.0 269364.4 0.3X
three partition columns, 100 partitions 5701 5844 202 0.0 570137.3 0.2X
three partition columns, 200 partitions 11105 11186 115 0.0 1110452.3 0.1X
three partition columns, 500 partitions 26978 27914 1324 0.0 2697786.6 0.0X
```
**Without this pr** the result is :
```
Running benchmark: dynamic insert table benchmark
Running case: one partition column, 10 partitions
Stopped after 3 iterations, 2610 ms
Running case: one partition column, 50 partitions
Stopped after 2 iterations, 5651 ms
Running case: one partition column, 100 partitions
Stopped after 2 iterations, 8813 ms
Running case: one partition column, 200 partitions
Stopped after 2 iterations, 16323 ms
Running case: one partition column, 500 partitions
Stopped after 2 iterations, 38269 ms
Running case: two partition columns, 10 partitions
Stopped after 3 iterations, 2730 ms
Running case: two partition columns, 50 partitions
Stopped after 2 iterations, 5574 ms
Running case: two partition columns, 100 partitions
Stopped after 2 iterations, 15787 ms
Running case: two partition columns, 200 partitions
Stopped after 2 iterations, 18852 ms
Running case: two partition columns, 500 partitions
Stopped after 2 iterations, 52470 ms
Running case: three partition columns, 10 partitions
Stopped after 3 iterations, 2366 ms
Running case: three partition columns, 50 partitions
Stopped after 2 iterations, 8141 ms
Running case: three partition columns, 100 partitions
Stopped after 2 iterations, 12490 ms
Running case: three partition columns, 200 partitions
Stopped after 2 iterations, 26581 ms
Running case: three partition columns, 500 partitions
Stopped after 2 iterations, 64463 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
dynamic insert table benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
one partition column, 10 partitions 789 870 72 0.0 78864.1 1.0X
one partition column, 50 partitions 2697 2826 182 0.0 269734.5 0.3X
one partition column, 100 partitions 4254 4407 216 0.0 425382.7 0.2X
one partition column, 200 partitions 8057 8162 148 0.0 805674.5 0.1X
one partition column, 500 partitions 18896 19135 338 0.0 1889591.7 0.0X
two partition columns, 10 partitions 754 910 241 0.0 75358.7 1.0X
two partition columns, 50 partitions 2701 2787 122 0.0 270120.7 0.3X
two partition columns, 100 partitions 7341 7894 782 0.0 734065.0 0.1X
two partition columns, 200 partitions 9404 9426 32 0.0 940371.7 0.1X
two partition columns, 500 partitions 23720 26235 NaN 0.0 2371963.0 0.0X
three partition columns, 10 partitions 751 789 38 0.0 75076.4 1.1X
three partition columns, 50 partitions 3802 4071 380 0.0 380180.7 0.2X
three partition columns, 100 partitions 6072 6245 245 0.0 607224.0 0.1X
three partition columns, 200 partitions 12874 13291 590 0.0 1287360.6 0.1X
three partition columns, 500 partitions 31451 32232 1104 0.0 3145143.9 0.0X
```
cc @cloud-fan seems no essential difference
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507361926
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
Review comment:
no other problems, Address 15c7519 fix this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r507304846
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+ data.write.saveAsTable(sourceTable)
+ data.count()
+ }
+
+ def writeOnePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part) " +
Review comment:
done
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+ data.write.saveAsTable(sourceTable)
+ data.count()
+ }
+
+ def writeOnePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part) " +
+ "select id, part1 as part from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part1, part2) " +
Review comment:
done
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRows(numberRows: Long, sourceTable: String,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+ data.write.saveAsTable(sourceTable)
+ data.count()
+ }
+
+ def writeOnePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part) " +
+ "select id, part1 as part from sourceTable")
+ }
+ }
+
+ def writeTwoPartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ benchmark.addCase(s"two partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part1, part2) " +
+ "select id, part1, part2 from sourceTable")
+ }
+ }
+
+ def writeThreePartitionColumnTable(tableName: String,
+ partitionNumber: Long, benchmark: Benchmark): Unit = {
+ spark.sql(s"create table $tableName(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ benchmark.addCase(s"three partition columns, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ s"$tableName partition(part1, part2, part3) " +
Review comment:
done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707620097
cc @cloud-fan
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714335785
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r503813333
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{LocalSparkSession, SparkSession}
+
+class BasicWriteJobStatsTrackerMetricSuite extends SparkFunSuite with LocalSparkSession {
+
+ test("SPARK-32978: make sure the number of dynamic part metric is correct") {
+ try {
+ val partitions = "50"
+ spark = SparkSession.builder().master("local[4]").getOrCreate()
+ val statusStore = spark.sharedState.statusStore
+ val oldExecutionsSize = statusStore.executionsList().size
+
+ spark.sql("create table dynamic_partition(i bigint, part bigint) " +
+ "using parquet partitioned by (part)").collect()
+ spark.sql("insert overwrite table dynamic_partition partition(part) " +
+ s"select id, id % $partitions as part from range(10000)").collect()
+
+ // Wait for listener to finish computing the metrics for the executions.
+ while (statusStore.executionsList().size - oldExecutionsSize < 4 ||
+ statusStore.executionsList().last.metricValues == null) {
+ Thread.sleep(100)
+ }
+
+ // There should be 4 SQLExecutionUIData in executionsList and the 3rd item is we need,
+ // but the executionId is indeterminate in maven test,
+ // so the `statusStore.execution(executionId)` API is not used.
+ assert(statusStore.executionsCount() == 4)
+ val executionData = statusStore.executionsList()(2)
+ val accumulatorIdOpt =
+ executionData.metrics.find(_.name == "number of dynamic part").map(_.accumulatorId)
+ assert(accumulatorIdOpt.isDefined)
+ val numPartsOpt = executionData.metricValues.get(accumulatorIdOpt.get)
+ assert(numPartsOpt.isDefined && numPartsOpt.get == partitions)
Review comment:
Without change of `BasicWriteStatsTracker.scala`, the numParts will be 200
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709234603
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34434/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-712648787
cc @cloud-fan Any other problems need to be fixed?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709449144
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709225205
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34436/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709932719
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709899037
**[Test build #129885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129885/testReport)** for PR 30026 at commit [`5769222`](https://github.com/apache/spark/commit/57692222be0ec45243c9fc574dd6ff06c87f9024).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709363125
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710086101
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34504/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707652650
@AngersZhuuuu That's what I'm worried about :(
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714577119
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707663518
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711715975
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709923031
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34490/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714451946
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776725
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709408879
**[Test build #129843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129843/testReport)** for PR 30026 at commit [`9edf8ad`](https://github.com/apache/spark/commit/9edf8ad99b32ea4b198d13ab197330ff875ba70c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776740
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129741/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506448618
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
Review comment:
done ~
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710060401
**[Test build #129898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129898/testReport)** for PR 30026 at commit [`6d80788`](https://github.com/apache/spark/commit/6d8078814886e94bd54d0058174a47e5ff607723).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709049996
**[Test build #129828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129828/testReport)** for PR 30026 at commit [`724eee6`](https://github.com/apache/spark/commit/724eee6acfd3754ab14d83d132df492700988cfc).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709078492
**[Test build #129831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129831/testReport)** for PR 30026 at commit [`6127fa5`](https://github.com/apache/spark/commit/6127fa56abf72ba5112092b9000200934e199b59).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506147287
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
override def newPartition(partitionValues: InternalRow): Unit = {
- numPartitions += 1
+ partitions = partitions :+ partitionValues
Review comment:
`partitions.appended(partitionValues)` need Scala 2.13
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710056295
Address 73e2ea6 reorganize the benchmark file
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711488346
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34581/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714492579
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709361686
**[Test build #129828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129828/testReport)** for PR 30026 at commit [`724eee6`](https://github.com/apache/spark/commit/724eee6acfd3754ab14d83d132df492700988cfc).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-714451946
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711472789
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34579/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709555923
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707776725
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710300555
**[Test build #129898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129898/testReport)** for PR 30026 at commit [`6d80788`](https://github.com/apache/spark/commit/6d8078814886e94bd54d0058174a47e5ff607723).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r509932012
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
Review comment:
Address fa01951 fix this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506294887
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
+
+ spark.sql("create table " +
+ "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet")
+
+ spark.sql("insert overwrite table sourceTable " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from tmpTable")
+
+ spark.catalog.dropTempView("tmpTable")
+ data.count()
+ }
+
+ def prepareTable(): Unit = {
+ spark.sql("create table " +
+ "tableOnePartitionColumn(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ spark.sql("create table " +
+ "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ spark.sql("create table " +
+ "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ }
+
+ def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableOnePartitionColumn partition(part) " +
Review comment:
can we pass table name as a parameter? Which is more robust and we don't need to worry about table name mismatch.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506294887
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
##########
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Benchmark to measure insert into table with dynamic partition columns.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to
+ * "benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt".
+ * }}}
+ */
+object InsertTableWithDynamicPartitionsBenchmark extends DataSourceWriteBenchmark {
+
+ override def getSparkSession: SparkSession = {
+ SparkSession.builder().master("local[4]").getOrCreate()
+ }
+
+ def prepareSourceTableAndGetTotalRowsCount(numberRows: Long,
+ part1Step: Int, part2Step: Int, part3Step: Int): Long = {
+ val dataFrame = spark.range(0, numberRows, 1, 4)
+ val dataFrame1 = spark.range(0, numberRows, part1Step, 4)
+ val dataFrame2 = spark.range(0, numberRows, part2Step, 4)
+ val dataFrame3 = spark.range(0, numberRows, part3Step, 4)
+
+ val data = dataFrame.join(dataFrame1).join(dataFrame2).join(dataFrame3)
+ .toDF("id", "part1", "part2", "part3")
+
+ data.createOrReplaceTempView("tmpTable")
+
+ spark.sql("create table " +
+ "sourceTable(id bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet")
+
+ spark.sql("insert overwrite table sourceTable " +
+ s"select id, " +
+ s"part1, " +
+ s"part2, " +
+ s"part3 " +
+ s"from tmpTable")
+
+ spark.catalog.dropTempView("tmpTable")
+ data.count()
+ }
+
+ def prepareTable(): Unit = {
+ spark.sql("create table " +
+ "tableOnePartitionColumn(i bigint, part bigint) " +
+ "using parquet partitioned by (part)")
+ spark.sql("create table " +
+ "tableTwoPartitionColumn(i bigint, part1 bigint, part2 bigint) " +
+ "using parquet partitioned by (part1, part2)")
+ spark.sql("create table " +
+ "tableThreePartitionColumn(i bigint, part1 bigint, part2 bigint, part3 bigint) " +
+ "using parquet partitioned by (part1, part2, part3)")
+ }
+
+ def writeOnePartitionColumnTable(partitionNumber: Long, benchmark: Benchmark): Unit = {
+ benchmark.addCase(s"one partition column, $partitionNumber partitions") { _ =>
+ spark.sql("insert overwrite table " +
+ "tableOnePartitionColumn partition(part) " +
Review comment:
can we pass table name as a parameter? Which is more robust.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r505137273
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
import org.apache.spark.util.SerializableConfiguration
+
/**
* Simple metrics collected during an instance of [[FileFormatDataWriter]].
* These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
*/
case class BasicWriteTaskStats(
- numPartitions: Int,
+ partitions: Seq[InternalRow],
Review comment:
ok ~ busy with my own work, will give feedback later
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711456094
**[Test build #129971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129971/testReport)** for PR 30026 at commit [`d630653`](https://github.com/apache/spark/commit/d6306530ff512678bdf5e683417f70250c07fca5).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707775497
**[Test build #129741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129741/testReport)** for PR 30026 at commit [`0769ac1`](https://github.com/apache/spark/commit/0769ac1fbe3b379fc6482e88095e57a895b149f2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711465962
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34579/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang edited a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-712648787
@cloud-fan Any other problems need to be fixed?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r503945310
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
import org.apache.spark.util.SerializableConfiguration
+
/**
* Simple metrics collected during an instance of [[FileFormatDataWriter]].
* These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
*/
case class BasicWriteTaskStats(
- numPartitions: Int,
+ partitions: Seq[InternalRow],
Review comment:
This increases the data size we need to transfer between executors and the driver. Do we have a microbenchmark to verify the impact?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r504061074
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -30,12 +32,13 @@ import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
import org.apache.spark.util.SerializableConfiguration
+
/**
* Simple metrics collected during an instance of [[FileFormatDataWriter]].
* These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
*/
case class BasicWriteTaskStats(
- numPartitions: Int,
+ partitions: Seq[InternalRow],
Review comment:
end-to-end performance of an INSERT query, with a partitioned table with 1 or 2 or 3 partitioned columns.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-710096702
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709071052
return size is partition num * shuffle num always can be millions level
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711606360
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709237332
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34436/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #30026:
URL: https://github.com/apache/spark/pull/30026#discussion_r506148242
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -139,20 +142,22 @@ class BasicWriteJobStatsTracker(
override def processStats(stats: Seq[WriteTaskStats]): Unit = {
val sparkContext = SparkContext.getActive.get
- var numPartitions: Long = 0L
+ var partitionsSet: mutable.Set[InternalRow] = mutable.HashSet.empty
var numFiles: Long = 0L
var totalNumBytes: Long = 0L
var totalNumOutput: Long = 0L
val basicStats = stats.map(_.asInstanceOf[BasicWriteTaskStats])
basicStats.foreach { summary =>
- numPartitions += summary.numPartitions
+ partitionsSet ++= summary.partitions
Review comment:
ditto, `partitionsSet.addAll(summary.partitions)` can only be used in Scala 2.13 too.
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -76,7 +79,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
override def newPartition(partitionValues: InternalRow): Unit = {
- numPartitions += 1
+ partitions = partitions :+ partitionValues
Review comment:
`partitions.appended(partitionValues)` can only be used in Scala 2.13
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-707663518
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-711606360
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30026: [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30026:
URL: https://github.com/apache/spark/pull/30026#issuecomment-709391509
**[Test build #129831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129831/testReport)** for PR 30026 at commit [`6127fa5`](https://github.com/apache/spark/commit/6127fa56abf72ba5112092b9000200934e199b59).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org