You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "liupengcheng (Jira)" <ji...@apache.org> on 2019/12/31 03:54:00 UTC
[jira] [Created] (SPARK-30394) Skip collecting stats in
DetermineTableStats rule when hive table is convertible to datasource
tables
liupengcheng created SPARK-30394:
------------------------------------
Summary: Skip collecting stats in DetermineTableStats rule when hive table is convertible to datasource tables
Key: SPARK-30394
URL: https://issues.apache.org/jira/browse/SPARK-30394
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.3.2, 3.0.0
Reporter: liupengcheng
Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will scan hdfs files to collect table stats in `DetermineTableStats` rule. But this can be expensive in some cases, acutually we can skip this if this hive table can be converted to datasource table(parquet etc.).
Before[SPARK-28573|https://issues.apache.org/jira/browse/SPARK-28573], the implementaion will update the CatalogTableStatistics, which will cause the improper stats be used in joinSelection when the hive table can be convert to datasource table.
In our production environment, user's highly compressed parquet table can cause OOMs when doing `broadcastHashJoin` due to this improper stats.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org