You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2017/06/12 00:25:20 UTC
[jira] [Resolved] (SPARK-21031) Add `alterTableStats` to store
spark's stats and let `alterTable` keep existing stats
[ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-21031.
---------------------------------
Resolution: Fixed
Fix Version/s: 2.3.0
Issue resolved by pull request 18248
[https://github.com/apache/spark/pull/18248]
> Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats
> -------------------------------------------------------------------------------------
>
> Key: SPARK-21031
> URL: https://issues.apache.org/jira/browse/SPARK-21031
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Zhenhua Wang
> Fix For: 2.3.0
>
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.
> Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_name data_type comment
> i string NULL
> j string NULL
> # Detailed Table Information
> Database default
> Table xx
> Owner wzh
> Created Thu Jun 08 18:30:46 PDT 2017
> Last Access Wed Dec 31 16:00:00 PST 1969
> Type MANAGED
> Provider hive
> Properties [serialization.format=1]
> Statistics 4 bytes
> Location file:/Users/wzh/Projects/spark/spark-warehouse/xx
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider Catalog
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_name data_type comment
> i string NULL
> j string NULL
> # Detailed Table Information
> Database default
> Table xx
> Owner wzh
> Created Thu Jun 08 18:30:46 PDT 2017
> Last Access Wed Dec 31 16:00:00 PST 1969
> Type MANAGED
> Provider hive
> Properties [serialization.format=1]
> Statistics 4 bytes (-- This should be 8 bytes)
> Location file:/Users/wzh/Projects/spark/spark-warehouse/xx
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider Catalog
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org