You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuzhou Sun (Jira)" <ji...@apache.org> on 2021/10/17 03:38:00 UTC

[jira] [Updated] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

     [ https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuzhou Sun updated SPARK-37027:
-------------------------------
    Attachment: SPARK-37027-test-example.patch

> Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-37027
>                 URL: https://issues.apache.org/jira/browse/SPARK-37027
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5, 3.1.2
>            Reporter: Yuzhou Sun
>            Priority: Trivial
>         Attachments: SPARK-37027-test-example.patch
>
>
> If a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can return doubled rows when reading the table. This issue seems to be an extension of SPARK-30507.
>  Reproduce steps:
>  # Create table and insert records via Hive (Spark doesn't allow to insert into table like this)
> {code:sql}
> CREATE TABLE `test_table`(
>   `c1` LONG,
>   `c2` STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '<tableLocationPath>';
> INSERT INTO TABLE `test_table`
> VALUES (0, '0');
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> {code}
>  # Read above table from Spark
> {code:sql}
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> -- 0 0
> {code}
> But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will return same result as Hive (i.e. single row)
> A similar case is that, if a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} ‘s value. However, actually Hive seems to return only rows under {{tableLocation}}
> Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, Spark won’t double the rows when {{'path'='<tableLocation>'}}. If {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in {{TBLPROPERTIES}}
> Code examples for the above cases (diff patch wrote in {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org