You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuzhou Sun (Jira)" <ji...@apache.org> on 2021/10/17 03:35:00 UTC
[jira] [Created] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
Yuzhou Sun created SPARK-37027:
----------------------------------
Summary: Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
Key: SPARK-37027
URL: https://issues.apache.org/jira/browse/SPARK-37027
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.1.2, 2.4.5
Reporter: Yuzhou Sun
If a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can return doubled rows when reading the table. This issue seems to be an extension of SPARK-30507.
Reproduce steps:
# Create table and insert records via Hive (Spark doesn't allow to insert into table like this)
{code:sql}
CREATE TABLE `test_table`(
`c1` LONG,
`c2` STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<tableLocationPath>';
INSERT INTO TABLE `test_table`
VALUES (0, '0');
SELECT * FROM `test_table`;
-- will return
-- 0 0
{code}
# Read above table from Spark
{code:sql}
SELECT * FROM `test_table`;
-- will return
-- 0 0
-- 0 0
{code}
But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will return same result as Hive (i.e. single row)
A similar case is that, if a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} ‘s value. However, actually Hive seems to return only rows under {{tableLocation}}
Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, Spark won’t double the rows when {{'path'='<tableLocation>'}}. If {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in {{TBLPROPERTIES}}
Code examples for the above cases (diff patch wrote in {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org