You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2019/08/15 13:00:00 UTC
[jira] [Resolved] (SPARK-27592) Set the bucketed data source table
SerDe correctly
[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-27592.
---------------------------------
Resolution: Fixed
Fix Version/s: 3.0.0
Issue resolved by pull request 24486
[https://github.com/apache/spark/pull/24486]
> Set the bucketed data source table SerDe correctly
> --------------------------------------------------
>
> Key: SPARK-27592
> URL: https://issues.apache.org/jira/browse/SPARK-27592
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Yuming Wang
> Assignee: Yuming Wang
> Priority: Major
> Fix For: 3.0.0
>
>
> We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
> 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
> spark-sql> DESC EXTENDED t;
> c1 int NULL
> c2 int NULL
> # Detailed Table Information
> Database default
> Table t
> Owner yumwang
> Created Time Mon Apr 29 17:52:05 CST 2019
> Last Access Thu Jan 01 08:00:00 CST 1970
> Created By Spark 2.4.0
> Type MANAGED
> Provider parquet
> Num Buckets 2
> Bucket Columns [`c1`]
> Sort Columns [`c1`]
> Table Properties [transient_lastDdlTime=1556531525]
> Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Storage Properties [serialization.format=1]
> {noformat}
> We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
> {noformat}
> But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org