You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2019/10/28 19:44:00 UTC

[jira] [Commented] (SPARK-27592) Set the bucketed data source table SerDe correctly

    [ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961404#comment-16961404 ] 

Dongjoon Hyun commented on SPARK-27592:
---------------------------------------

Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this is registered as an improvement PR, so we don't backport this to the older branches. However, given the situation, I'm also fine for [~yumwang] to backporting these since he is the author of this.

BTW, please don't expect a backporting to EOL branches like `branch-2.3`.
 - [https://spark.apache.org/versioning-policy.html]
{quote}Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes.
{quote}

> Set the bucketed data source table SerDe correctly
> --------------------------------------------------
>
>                 Key: SPARK-27592
>                 URL: https://issues.apache.org/jira/browse/SPARK-27592
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>             Fix For: 3.0.0
>
>
> We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
>  2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
>  spark-sql> DESC EXTENDED t;
>  c1 int NULL
>  c2 int NULL
>  # Detailed Table Information
>  Database default
>  Table t
>  Owner yumwang
>  Created Time Mon Apr 29 17:52:05 CST 2019
>  Last Access Thu Jan 01 08:00:00 CST 1970
>  Created By Spark 2.4.0
>  Type MANAGED
>  Provider parquet
>  Num Buckets 2
>  Bucket Columns [`c1`]
>  Sort Columns [`c1`]
>  Table Properties [transient_lastDdlTime=1556531525]
>  Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  Storage Properties [serialization.format=1]
> {noformat}
>  We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
> {noformat}
>  But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org