You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (JIRA)" <ji...@apache.org> on 2019/04/18 04:05:00 UTC

[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing

     [ https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Robbins updated SPARK-27498:
----------------------------------
    Description: 
_Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not.

 

 

  was:
_Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not.

 

 


> Built-in parquet code path does not respect hive.enforce.bucketing
> ------------------------------------------------------------------
>
>                 Key: SPARK-27498
>                 URL: https://issues.apache.org/jira/browse/SPARK-27498
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> _Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._
>   
>  Spark makes an effort to avoid corrupting bucketed Hive tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false.
> However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table.
> Here's an example.
> In Hive, do this (I create a table where things work as expected, and one where things don't work as expected):
> {noformat}
> hive> create table sourcetable as select 1 a, 3 b, 7 c;
> hive> drop table hivebuckettext1;
> hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile;
> hive> insert into hivebuckettext1 select * from sourcetable;
> hive> drop table hivebucketparq1;
> hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet;
> hive> insert into hivebucketparq1 select * from sourcetable;
> {noformat}
> For the text table, things seem to work as expected:
> {noformat}
> scala> sql("insert into hivebuckettext1 select 1, 2, 3")
> 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
> org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
> {noformat}
> For the parquet table, the insert just happens:
> {noformat}
> scala> sql("insert into hivebucketparq1 select 1, 2, 3")
> res1: org.apache.spark.sql.DataFrame = []
> scala> 
> {noformat}
> Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497).
> If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected.
> Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org