You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chenxiao Mao (JIRA)" <ji...@apache.org> on 2018/09/10 05:46:00 UTC

[jira] [Created] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

Chenxiao Mao created SPARK-25391:
------------------------------------

             Summary: Make behaviors consistent when converting parquet hive table to parquet data source
                 Key: SPARK-25391
                 URL: https://issues.apache.org/jira/browse/SPARK-25391
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Chenxiao Mao


parquet data source tables and hive parquet tables have different behaviors about parquet field resolution. So, when {{spark.sql.hive.convertMetastoreParquet}} is true, users might face inconsistent behaviors. The differences are:
 * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both data source tables and hive tables do NOT respect {{spark.sql.caseSensitive}}. However data source tables always do case-sensitive parquet field resolution, while hive tables always do case-insensitive parquet field resolution no matter whether {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data source tables respect {{spark.sql.caseSensitive}} while hive serde table behavior is not changed.
 * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive tables always return the first matched column ignoring cases. SPARK-25132 let data source tables throw exception when there is ambiguity while hive table behavior is not changed.

This ticket aims to make behaviors consistent when converting hive table to data source table.
 * The behavior must be consistent to do the conversion, so we skip the conversion in case-sensitive mode because hive parquet table always do case-insensitive field resolution.
 * In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org