You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2017/06/09 06:14:18 UTC
[jira] [Created] (SPARK-21031) Clearly separate hive stats and
spark stats in catalog
Zhenhua Wang created SPARK-21031:
------------------------------------
Summary: Clearly separate hive stats and spark stats in catalog
Key: SPARK-21031
URL: https://issues.apache.org/jira/browse/SPARK-21031
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang
Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats.
For example, by using "ALTER TABLE" command, we will store the stats info (read from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.
Besides, now that we store wrong spark's stats, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats over hive's stats.
{code}
spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';
spark-sql> desc formatted xx;
# col_name data_type comment
i string NULL
j string NULL
# Detailed Table Information
Database default
Table xx
Owner wzh
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
Type MANAGED
Provider hive
Properties [serialization.format=1]
Statistics 4 bytes
Location file:/Users/wzh/Projects/spark/spark-warehouse/xx
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider Catalog
Time taken: 0.089 seconds, Fetched 19 row(s)
spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 seconds
spark-sql> insert into table xx select 'c', 'd';
Time taken: 0.583 seconds
spark-sql> desc formatted xx;
# col_name data_type comment
i string NULL
j string NULL
# Detailed Table Information
Database default
Table xx
Owner wzh
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
Type MANAGED
Provider hive
Properties [serialization.format=1]
Statistics 4 bytes (-- This should be 8 bytes)
Location file:/Users/wzh/Projects/spark/spark-warehouse/xx
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider Catalog
Time taken: 0.077 seconds, Fetched 19 row(s)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org