You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2017/07/08 02:29:00 UTC

[jira] [Updated] (SPARK-21083) Consider staleness when collecting column stats

     [ https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhenhua Wang updated SPARK-21083:
---------------------------------
    Description: 
1. When we first analyze without `noscan` and then analyze with `noscan`, the table is not changed, so we should keep row count in statistics.
2. When we first analyze one column in table and then analyze another column, the table is not changed, so we should keep the previous column stats and combine them with the newly collected column stats.

  was:
Suppose we already collected column stats for some columns before, then, when we collect column stats for other columns:
* If the table is changed during two collecting actions, we need to remove these stale column stats, only keep the latest stats.
* Otherwise, combine these two sets of column stats.

Note that we always update sizeInBytes/rowCount when collecting column stats, that logic doesn't need change.


> Consider staleness when collecting column stats
> -----------------------------------------------
>
>                 Key: SPARK-21083
>                 URL: https://issues.apache.org/jira/browse/SPARK-21083
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Zhenhua Wang
>
> 1. When we first analyze without `noscan` and then analyze with `noscan`, the table is not changed, so we should keep row count in statistics.
> 2. When we first analyze one column in table and then analyze another column, the table is not changed, so we should keep the previous column stats and combine them with the newly collected column stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org