You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (JIRA)" <ji...@apache.org> on 2019/04/19 15:22:00 UTC

[jira] [Updated] (SPARK-27498) Built-in parquet code path (convertMetastoreParquet=true) does not respect hive.enforce.bucketing

     [ https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Robbins updated SPARK-27498:
----------------------------------
    Summary: Built-in parquet code path (convertMetastoreParquet=true) does not respect hive.enforce.bucketing  (was: Built-in parquet code path does not respect hive.enforce.bucketing)

> Built-in parquet code path (convertMetastoreParquet=true) does not respect hive.enforce.bucketing
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27498
>                 URL: https://issues.apache.org/jira/browse/SPARK-27498
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._
>   
>  Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false.
> However, this behavior falls down when Spark uses the built-in Parquet code path to write to the Hive table.
> Here's an example.
> In Hive, do this (I create a table where things work as expected, and one where things don't work as expected):
> {noformat}
> hive> create table sourcetable as select 1 a, 3 b, 7 c;
> hive> drop table hivebuckettext1;
> hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile;
> hive> insert into hivebuckettext1 select * from sourcetable;
> hive> drop table hivebucketparq1;
> hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet;
> hive> insert into hivebucketparq1 select * from sourcetable;
> {noformat}
> For the text table, things seem to work as expected:
> {noformat}
> scala> sql("insert into hivebuckettext1 select 1, 2, 3")
> 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
> org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
> {noformat}
> For the parquet table, the insert just happens:
> {noformat}
> scala> sql("insert into hivebucketparq1 select 1, 2, 3")
> res1: org.apache.spark.sql.DataFrame = []
> scala> 
> {noformat}
> Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497).
> If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected.
> Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. Probably the check should be made in an analyzer rule while the InsertIntoTable node still holds a HiveTableRelation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org