You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shixiong Zhu (Jira)" <ji...@apache.org> on 2021/03/12 19:50:00 UTC
[jira] [Commented] (SPARK-19256) Hive bucketing write support

    [ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300578#comment-17300578 ] 

Shixiong Zhu commented on SPARK-19256:
--------------------------------------

I read [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing] but didn't find any section about compatibility. I'm curious about both the forward and backward compatibility. For example, 
 * What happens if using an old Spark version to read this new format? Does it treat it as a non-bucketed table, or read it a a bucketed table but may return different results?
 * How does Spark know which bucket format when reading an existing table, and how does it decide whether to use the old format or the new format to read and write?

> Hive bucketing write support
> ----------------------------
>
>                 Key: SPARK-19256
>                 URL: https://issues.apache.org/jira/browse/SPARK-19256
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>            Reporter: Tejas Patil
>            Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. The goal is for Spark to write Hive bucketed table, to be compatible with other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write orc/parquet bucketed table as non-bucketed table (code path: InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception by default if writing non-orc/parquet bucketed table (code path: InsertIntoHiveTable), and exception can be disabled by setting config `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on read path - [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212] .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and Hive. Here with this JIRA, we need to add support writing Hive bucketed table with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical change and we decide to wait until data source v2 supports bucketing, and do the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support in Spark.
> Proposal : [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org