You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/29 07:13:51 UTC

[GitHub] [spark] HyukjinKwon commented on a change in pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats

HyukjinKwon commented on a change in pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats
URL: https://github.com/apache/spark/pull/26016#discussion_r339918697

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1331,6 +1331,29 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

+ val DESERIALIZATION_FACTOR_CALC_ENABLED =
+ buildConf("spark.sql.statistics.deserFactor.calc.enabled")
+ .doc("Enables the calculation of the deserialization factor as a table statistic. " +
+ "This factor is intended to be calculated for columnar storage formats as a ratio of " +
+ "actual data size to raw file size but currently Spark calculates this only for the ORC " +
+ "format. Spark uses this ratio is to scale up the estimated size, which leads to " +
+ "better estimate of in-memory data size and improves the query optimization (i.e., join " +
+ "strategy). In case of partitioned table the maximum of these factors is taken. " +
+ "Spark stores this factor in the meta store and reuses it so the table " +
+ "can grow without having to recompute this statistic. " +
+ "The stored factor can be removed only by a TRUNCATE or a DROP table so even a " +
+ "subsequent ANALYZE TABLE where the calculation is disabled keeps the old value.")
+ .booleanConf
+ .createWithDefault(false)
+
+ val DESERIALIZATION_FACTOR_EXTRA_DISTORTION =
+ buildConf("spark.sql.statistics.deserFactor.distortion")

Review comment:
Hm .. so .. does this PR propose below?

1. get the size of file in ORC and use it as stats
2. one configuration to control the ratio of the size

If that's correct, looks like we should better separate PRs.
Also, do we have some perf numbers? I have some doubts per https://github.com/apache/spark/pull/26016#discussion_r338386395 . It reminds me of `mergeSchema` in Parquet, which brings considerable performance penalty.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org