You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yuzhou Sun (Jira)" <ji...@apache.org> on 2021/10/17 03:35:00 UTC

[jira] [Created] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

Yuzhou Sun created SPARK-37027:
----------------------------------

             Summary: Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
                 Key: SPARK-37027
                 URL: https://issues.apache.org/jira/browse/SPARK-37027
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.1.2, 2.4.5
            Reporter: Yuzhou Sun


If a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can return doubled rows when reading the table. This issue seems to be an extension of SPARK-30507.


 Reproduce steps:
 # Create table and insert records via Hive (Spark doesn't allow to insert into table like this)
{code:sql}
CREATE TABLE `test_table`(
  `c1` LONG,
  `c2` STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<tableLocationPath>';

INSERT INTO TABLE `test_table`
VALUES (0, '0');

SELECT * FROM `test_table`;
-- will return
-- 0 0
{code}

 # Read above table from Spark
{code:sql}
SELECT * FROM `test_table`;
-- will return
-- 0 0
-- 0 0
{code}

But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will return same result as Hive (i.e. single row)

A similar case is that, if a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} ‘s value. However, actually Hive seems to return only rows under {{tableLocation}}

Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, Spark won’t double the rows when {{'path'='<tableLocation>'}}. If {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in {{TBLPROPERTIES}}

Code examples for the above cases (diff patch wrote in {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org