You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2017/05/28 17:08:04 UTC

[jira] [Updated] (SPARK-20881) Clearly document the mechanism to choose between two sources of statistics

     [ https://issues.apache.org/jira/browse/SPARK-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhenhua Wang updated SPARK-20881:
---------------------------------
    Summary: Clearly document the mechanism to choose between two sources of statistics  (was: Use Hive's stats in metastore when cbo is disabled)

> Clearly document the mechanism to choose between two sources of statistics
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20881
>                 URL: https://issues.apache.org/jira/browse/SPARK-20881
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Zhenhua Wang
>
> Currently statistics are generated by "analyze command" in Spark. 
> However, when user updates the table and collects stats in Hive, "totalSize"/"numRows" will be updated in metastore. 
> Now, in spark side, table stats become stale. 
> If cbo is enabled, this is ok because we suppose user will handle this and re-run the command to update  stats. 
> If cbo is disabled, then we should fallback to original way and respect hive's stats. But in current implementation, spark's stats always override hive's stats, no matter cbo is enabled or disabled.
> The right thing to do is to use (don't override) hive's stats when cbo is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org