You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (JIRA)" <ji...@apache.org> on 2019/04/17 23:41:00 UTC
[jira] [Created] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing

Bruce Robbins created SPARK-27498:
-------------------------------------

             Summary: Built-in parquet code path does not respect hive.enforce.bucketing
                 Key: SPARK-27498
                 URL: https://issues.apache.org/jira/browse/SPARK-27498
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0, 3.0.0
            Reporter: Bruce Robbins


_Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org