You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/14 08:42:54 UTC
[GitHub] [spark] AngersZhuuuu opened a new pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
AngersZhuuuu opened a new pull request #31179:
URL: https://github.com/apache/spark/pull/31179
### What changes were proposed in this pull request?
In current code, when we insert into Hive table, we only update `sizeInBytes`, won't update `rowCount` and partition's statistics.
This PR collect partition's statistic and total statistic when write data. Then use these data update metadata statistics
### Why are the changes needed?
Update statistics is useful
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811754099
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136790/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811659269
**[Test build #136795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136795/testReport)** for PR 31179 at commit [`70e7425`](https://github.com/apache/spark/commit/70e74254acc17ca02ba90c70a7b097b39308ee65).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811660012
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41376/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760091789
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38632/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773151680
> how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it...
In origin way.
1. We just update sizeInByte wont update rowCount, then if we want to get rowCount, we need to re-run analyze command, it's a bit overhead.
2. When update sizeInByte, in origin logical, we need to fetch all file status under target directory. Since spark always have small file problem. It's slow especially when user's HDFS is slow.
In current way:
1. We collect row count info when write data and return it to driver throw `BasicWriteJobStatsTracker`, since we have discussed before in https://github.com/apache/spark/pull/30026#issuecomment-709868109, we add return partition info to `BasicWriteJobStatsTracker` is not a concern. So in my pr, carry a `PartitionsStats` can't be a concern too.
2. Just use metric data from `BasicWriteJobStatsTracker` to update statistic metadata. Now other behavior. In this way it should be faster then origin way since we don't need to fetch all file's status.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772230432
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39394/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811635011
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41373/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811659972
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811685509
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41378/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808871039
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41194/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115053
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139856/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811634186
**[Test build #136791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136791/testReport)** for PR 31179 at commit [`be37e31`](https://github.com/apache/spark/commit/be37e3153fbf07e81f0536a83b9214063fa9704e).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862229797
**[Test build #139871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139871/testReport)** for PR 31179 at commit [`7fdf7d0`](https://github.com/apache/spark/commit/7fdf7d0ef79d78bb015eb92cc78bc0f7df607208).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772270200
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39394/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808897476
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136612/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r569154926
##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##########
@@ -2581,6 +2581,59 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi
}
}
}
+
+ test("xxx") {
Review comment:
Looks like we should have a proper test name btw
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655843657
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
Review comment:
> I think the question was why we should separate them.
Since when single partition, it return statistic data same as insert into a non-partition table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862226856
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44385/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811791013
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136791/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811660012
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41376/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808871039
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41194/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771734983
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39367/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-809020307
Any more suggestion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862214361
Kubernetes integration test unable to build dist.
exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44385/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811619794
**[Test build #136790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136790/testReport)** for PR 31179 at commit [`31821ff`](https://github.com/apache/spark/commit/31821ffe46ab1b95a536a1a65727448a9cd47941).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811633429
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41373/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862108939
**[Test build #139856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)** for PR 31179 at commit [`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772345392
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39402/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760030667
**[Test build #134046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134046/testReport)** for PR 31179 at commit [`175619c`](https://github.com/apache/spark/commit/175619c1c6193082aca4c3db0521dbda3cec8358).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773113604
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811685523
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41378/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655843657
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
Review comment:
> I think the question was why we should separate them.
Since when single partition, it return statistic data same as insert into a non-partition table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772281834
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39394/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811681712
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41378/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771660086
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772454882
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134814/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772326648
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39402/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808865653
**[Test build #136612 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136612/testReport)** for PR 31179 at commit [`d0e27d0`](https://github.com/apache/spark/commit/d0e27d0031a095bd66ebe13ae16ba29440818191).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811634996
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41373/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811634186
**[Test build #136791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136791/testReport)** for PR 31179 at commit [`be37e31`](https://github.com/apache/spark/commit/be37e3153fbf07e81f0536a83b9214063fa9704e).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r605371258
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
// See a new partition or bucket - write to a new partition dir (or a new bucket file).
if (isPartitioned && currentPartitionValues != nextPartitionValues) {
currentPartitionValues = Some(nextPartitionValues.get.copy())
- statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+ val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+ val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)
Review comment:
> These `UnsafeProjection` objects can be created and reused?
Seems we don't need this, remove this now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811822932
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136795/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771660086
**[Test build #134785 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134785/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773113604
Gentle ping @cloud-fan @viirya
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773131159
how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811822932
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136795/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811619794
**[Test build #136790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136790/testReport)** for PR 31179 at commit [`31821ff`](https://github.com/apache/spark/commit/31821ffe46ab1b95a536a1a65727448a9cd47941).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811635011
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41373/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772210554
**[Test build #134806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134806/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-931793103
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772281834
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39394/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772433966
**[Test build #134814 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134814/testReport)** for PR 31179 at commit [`68e0256`](https://github.com/apache/spark/commit/68e025600fcea3a5ee65206c4f62c71effd5acdd).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808865653
**[Test build #136612 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136612/testReport)** for PR 31179 at commit [`d0e27d0`](https://github.com/apache/spark/commit/d0e27d0031a095bd66ebe13ae16ba29440818191).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811791013
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136791/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690647
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-818634558
Any more suggestion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772284221
**[Test build #134814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134814/testReport)** for PR 31179 at commit [`68e0256`](https://github.com/apache/spark/commit/68e025600fcea3a5ee65206c4f62c71effd5acdd).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-810305078
The idea looks good. @viirya what do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604562571
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
// See a new partition or bucket - write to a new partition dir (or a new bucket file).
if (isPartitioned && currentPartitionValues != nextPartitionValues) {
currentPartitionValues = Some(nextPartitionValues.get.copy())
- statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+ val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+ val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)
Review comment:
These `UnsafeProjection` objects can be created and reused?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760111168
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772454882
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134814/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760096307
**[Test build #134046 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134046/testReport)** for PR 31179 at commit [`175619c`](https://github.com/apache/spark/commit/175619c1c6193082aca4c3db0521dbda3cec8358).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `case class PartitionStats(var numFiles: Int = 0, var numBytes: Long = 0, var numRows: Long = 0) `
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772350624
**[Test build #134806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134806/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r569161195
##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##########
@@ -2581,6 +2581,59 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi
}
}
}
+
+ test("xxx") {
Review comment:
> Looks like we should have a proper test name btw
Hmmmm, updated ==
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #31179:
URL: https://github.com/apache/spark/pull/31179
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862226856
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44385/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862108939
**[Test build #139856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)** for PR 31179 at commit [`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771734983
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39367/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772345392
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39402/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604554605
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
// See a new partition or bucket - write to a new partition dir (or a new bucket file).
if (isPartitioned && currentPartitionValues != nextPartitionValues) {
currentPartitionValues = Some(nextPartitionValues.get.copy())
- statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+ val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+ val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)
+ val attrRow = proj(currentPartitionValues.get)
+ val value = if (attrRow.isNullAt(0)) {
+ null
+ } else {
+ Cast(Literal(attrRow.get(0, attr.dataType), attr.dataType),
+ StringType, Some(SQLConf.get.sessionLocalTimeZone)).eval().toString
+ }
+ attr.name -> value
Review comment:
> this additional projection looks not good for performance.
So should we not do projection here and do this in driver side when update metrics?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772204354
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808893689
**[Test build #136612 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136612/testReport)** for PR 31179 at commit [`d0e27d0`](https://github.com/apache/spark/commit/d0e27d0031a095bd66ebe13ae16ba29440818191).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772352785
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134806/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604274317
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
##########
@@ -251,7 +252,18 @@ class DynamicPartitionDataWriter(
// See a new partition or bucket - write to a new partition dir (or a new bucket file).
if (isPartitioned && currentPartitionValues != nextPartitionValues) {
currentPartitionValues = Some(nextPartitionValues.get.copy())
- statsTrackers.foreach(_.newPartition(currentPartitionValues.get))
+ val partitionSpec: Map[String, String] = description.partitionColumns.map(attr => {
+ val proj = UnsafeProjection.create(Seq(attr), description.partitionColumns)
+ val attrRow = proj(currentPartitionValues.get)
+ val value = if (attrRow.isNullAt(0)) {
+ null
+ } else {
+ Cast(Literal(attrRow.get(0, attr.dataType), attr.dataType),
+ StringType, Some(SQLConf.get.sessionLocalTimeZone)).eval().toString
+ }
+ attr.name -> value
Review comment:
this additional projection looks not good for performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862224107
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44388/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690642
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r605358714
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
Review comment:
> Why do we need to handle the single partition case and the non-single partition case separately?
yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771707903
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39367/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-818634558
Any more suggestion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760111170
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-773131159
how big is the overhead? I had an impression that auto stats update is very expensive and not many people are using it...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115008
**[Test build #139856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)** for PR 31179 at commit [`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r604513642
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ val newPartitions = partition.flatten { part =>
+ val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+ CommandUtils.mergeNewStats(
Review comment:
ditto
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -199,6 +312,17 @@ object CommandUtils extends Logging {
newStats
}
+ def mergeNewStats(
Review comment:
private
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
Review comment:
`CommandUtils.mergeNewStats` -> `mergeNewStats`
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
Review comment:
`val spec = partitionSpec.mapValues(_.get)`
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ val newPartitions = partition.flatten { part =>
+ val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+ CommandUtils.mergeNewStats(
+ part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ } else {
+ CommandUtils.compareAndGetNewStats(
+ part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ }
+ newStates.map(_ => part.copy(stats = newStates))
+ }
+ if (newTableStats.isDefined) {
+ catalog.alterTableStats(table.identifier, newTableStats)
+ }
+ if (newPartitions.nonEmpty) {
+ catalog.alterPartitions(table.identifier, newPartitions)
+ }
+ } else {
+ // update all partitions statistics
+ val partitions = statsTracker.partitionsStats.map { case (part, stats) =>
+ val partition = catalog.getPartition(table.identifier, part)
+ val newStats = Some(CatalogStatistics(
+ sizeInBytes = stats.numBytes, rowCount = Some(stats.numRows)))
+ partition.copy(stats = newStats)
+ }.toSeq
+ if (partitions.nonEmpty) {
+ catalog.alterPartitions(table.identifier, partitions)
+ }
+
+ if (isPartialPartitions) {
+ val newStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ if (newStats.isDefined) {
+ catalog.alterTableStats(table.identifier, newStats)
+ }
+ } else {
+ val newStats = CommandUtils.compareAndGetNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ if (newStats.isDefined) {
+ catalog.alterTableStats(table.identifier, newStats)
+ }
+ }
+ }
+ } else {
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
Review comment:
`val spec = partitionSpec.mapValues(_.get)`
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
Review comment:
Could you move `isPartialPartitions` into L102?
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -38,25 +39,40 @@ import org.apache.spark.util.SerializableConfiguration
* These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
*/
case class BasicWriteTaskStats(
- partitions: Seq[InternalRow],
- numFiles: Int,
- numBytes: Long,
- numRows: Long)
+ partitionsStats: mutable.Map[TablePartitionSpec, PartitionStats],
Review comment:
partitionsStats -> partitionSpecWithStats
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
Review comment:
Why do we need to handle the single partition case and the non-single partition case separately?
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ val newPartitions = partition.flatten { part =>
Review comment:
This block seems to be the same with the `overwrite=false` case? https://github.com/apache/spark/pull/31179/files#diff-6309057f8f41f20f8de513ab67d7755aae5fb30d7441fc21000999c9e8e8e0bfR125-R140
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862114525
**[Test build #139860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139860/testReport)** for PR 31179 at commit [`7fdf7d0`](https://github.com/apache/spark/commit/7fdf7d0ef79d78bb015eb92cc78bc0f7df607208).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811685523
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41378/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772284221
**[Test build #134814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134814/testReport)** for PR 31179 at commit [`68e0256`](https://github.com/apache/spark/commit/68e025600fcea3a5ee65206c4f62c71effd5acdd).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760030667
**[Test build #134046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134046/testReport)** for PR 31179 at commit [`175619c`](https://github.com/apache/spark/commit/175619c1c6193082aca4c3db0521dbda3cec8358).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690642
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115053
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139856/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r557215501
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##########
@@ -135,30 +163,35 @@ class BasicWriteJobStatsTracker(
@transient val metrics: Map[String, SQLMetric])
extends WriteJobStatsTracker {
+ @transient val partitionsStats: mutable.Map[TablePartitionSpec, PartitionStats] =
+ mutable.Map.empty
+ @transient var numFiles: Long = 0L
+ @transient var totalNumBytes: Long = 0L
+ @transient var totalNumOutput: Long = 0L
Review comment:
Seem here need to use accumulate ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771685532
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39367/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808897476
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136612/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862214478
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811659269
**[Test build #136795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136795/testReport)** for PR 31179 at commit [`70e7425`](https://github.com/apache/spark/commit/70e74254acc17ca02ba90c70a7b097b39308ee65).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655838827
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
Review comment:
I think the question was why we should separate them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r655838827
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
Review comment:
I think the question was why we should separate them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata metrics's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-760077095
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38632/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772352785
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134806/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811754099
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136790/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811742515
**[Test build #136790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136790/testReport)** for PR 31179 at commit [`31821ff`](https://github.com/apache/spark/commit/31821ffe46ab1b95a536a1a65727448a9cd47941).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-771690647
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772316870
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39402/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811816383
**[Test build #136795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136795/testReport)** for PR 31179 at commit [`70e7425`](https://github.com/apache/spark/commit/70e74254acc17ca02ba90c70a7b097b39308ee65).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-808869217
Kubernetes integration test unable to build dist.
exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41194/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-772210554
**[Test build #134806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134806/testReport)** for PR 31179 at commit [`fc82c38`](https://github.com/apache/spark/commit/fc82c3885f32a12ca33c96f7c98e330b73558dec).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-811786741
**[Test build #136791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136791/testReport)** for PR 31179 at commit [`be37e31`](https://github.com/apache/spark/commit/be37e3153fbf07e81f0536a83b9214063fa9704e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #31179:
URL: https://github.com/apache/spark/pull/31179#discussion_r605370937
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ val newPartitions = partition.flatten { part =>
Review comment:
> This block seems to be the same with the `overwrite=false` case? https://github.com/apache/spark/pull/31179/files#diff-6309057f8f41f20f8de513ab67d7755aae5fb30d7441fc21000999c9e8e8e0bfR125-R140
Done
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
Review comment:
> Could you move `isPartialPartitions` into L102?
Done
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ val newPartitions = partition.flatten { part =>
+ val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+ CommandUtils.mergeNewStats(
+ part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ } else {
+ CommandUtils.compareAndGetNewStats(
+ part.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ }
+ newStates.map(_ => part.copy(stats = newStates))
+ }
+ if (newTableStats.isDefined) {
+ catalog.alterTableStats(table.identifier, newTableStats)
+ }
+ if (newPartitions.nonEmpty) {
+ catalog.alterPartitions(table.identifier, newPartitions)
+ }
+ } else {
+ // update all partitions statistics
+ val partitions = statsTracker.partitionsStats.map { case (part, stats) =>
+ val partition = catalog.getPartition(table.identifier, part)
+ val newStats = Some(CatalogStatistics(
+ sizeInBytes = stats.numBytes, rowCount = Some(stats.numRows)))
+ partition.copy(stats = newStats)
+ }.toSeq
+ if (partitions.nonEmpty) {
+ catalog.alterPartitions(table.identifier, partitions)
+ }
+
+ if (isPartialPartitions) {
+ val newStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ if (newStats.isDefined) {
+ catalog.alterTableStats(table.identifier, newStats)
+ }
+ } else {
+ val newStats = CommandUtils.compareAndGetNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ if (newStats.isDefined) {
+ catalog.alterTableStats(table.identifier, newStats)
+ }
+ }
+ }
+ } else {
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
Review comment:
> `val spec = partitionSpec.mapValues(_.get)`
DOne
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
Review comment:
> `val spec = partitionSpec.mapValues(_.get)`
Done
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -199,6 +312,17 @@ object CommandUtils extends Logging {
newStats
}
+ def mergeNewStats(
Review comment:
> private
Done
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
+ newTable.stats, statsTracker.totalNumBytes, Some(statsTracker.totalNumOutput))
+ val newPartitions = partition.flatten { part =>
+ val newStates = if (part.stats.isDefined && part.stats.get.rowCount.isDefined) {
+ CommandUtils.mergeNewStats(
Review comment:
> ditto
Done
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
##########
@@ -51,6 +51,119 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial
object CommandUtils extends Logging {
+ def updateTableAndPartitionStats(
+ sparkSession: SparkSession,
+ table: CatalogTable,
+ overwrite: Boolean,
+ partitionSpec: Map[String, Option[String]],
+ statsTracker: BasicWriteJobStatsTracker): Unit = {
+ val catalog = sparkSession.sessionState.catalog
+ if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
+ val newTable = catalog.getTableMetadata(table.identifier)
+ val isSinglePartition = partitionSpec.nonEmpty && partitionSpec.values.forall(_.nonEmpty)
+ val isPartialPartitions = partitionSpec.nonEmpty &&
+ partitionSpec.values.exists(_.isEmpty) && partitionSpec.values.exists(_.nonEmpty)
+ if (overwrite) {
+ // Only update one partition, statsTracker.partitionsStats is empty
+ if (isSinglePartition) {
+ val spec = partitionSpec.map { case (key, value) =>
+ key -> value.get
+ }
+ val partition = catalog.listPartitions(table.identifier, Some(spec))
+ val newTableStats = CommandUtils.mergeNewStats(
Review comment:
> `CommandUtils.mergeNewStats` -> `mergeNewStats`
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org