You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (JIRA)" <ji...@apache.org> on 2019/04/29 10:08:00 UTC

[jira] [Updated] (SPARK-27592) Write the data of table write information to metadata

     [ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuming Wang updated SPARK-27592:
--------------------------------
    Description: 
We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table:
{noformat}
spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
 spark-sql> DESC EXTENDED t;
 c1 int NULL
 c2 int NULL
 # Detailed Table Information
 Database default
 Table t
 Owner yumwang
 Created Time Mon Apr 29 17:52:05 CST 2019
 Last Access Thu Jan 01 08:00:00 CST 1970
 Created By Spark 2.4.0
 Type MANAGED
 Provider parquet
 Num Buckets 2
 Bucket Columns [`c1`]
 Sort Columns [`c1`]
 Table Properties [transient_lastDdlTime=1556531525]
 Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
 OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 Storage Properties [serialization.format=1]
{noformat}
 We can see incompatible information when creating the table:
{noformat}
WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
{noformat}
 But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself.

> Write the data of table write information to metadata
> -----------------------------------------------------
>
>                 Key: SPARK-27592
>                 URL: https://issues.apache.org/jira/browse/SPARK-27592
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
>  2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
>  spark-sql> DESC EXTENDED t;
>  c1 int NULL
>  c2 int NULL
>  # Detailed Table Information
>  Database default
>  Table t
>  Owner yumwang
>  Created Time Mon Apr 29 17:52:05 CST 2019
>  Last Access Thu Jan 01 08:00:00 CST 1970
>  Created By Spark 2.4.0
>  Type MANAGED
>  Provider parquet
>  Num Buckets 2
>  Bucket Columns [`c1`]
>  Sort Columns [`c1`]
>  Table Properties [transient_lastDdlTime=1556531525]
>  Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  Storage Properties [serialization.format=1]
> {noformat}
>  We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
> {noformat}
>  But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org