You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/02/04 19:43:00 UTC
[jira] [Updated] (SPARK-30559) spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

     [ https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-30559:
----------------------------------
    Summary: spark.sql.hive.caseSensitiveInferenceMode does not work with Hive  (was: Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive)

> spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
> -----------------------------------------------------------------
>
>                 Key: SPARK-30559
>                 URL: https://issues.apache.org/jira/browse/SPARK-30559
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6
>            Reporter: Ori Popowski
>            Priority: Major
>
> In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work.
>  # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema
>  # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work)
> h3. Expected behavior (according to SPARK-19611)
> INFER_ONLY - infer the schema from the underlying files
> INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls
> h2. Reproduce
> h3. Prepare the data
> h4. 1) Create a Parquet file
> {code:scala}
> scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code}
>  
> h4. 2) Inspect the Parquet files
> {code:sh}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00000-….snappy.parquet
> {"theString":"a","theNumber":1}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00001-….snappy.parquet
> {"theString":"b","theNumber":2}{code}
> We see that they are saved with camelCase column names.
> h4. 3) Create a Hive table 
> {code:sql}
> hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
>  > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>  > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  > LOCATION 'hdfs:///t';{code}
>  
> h3. Reproduce INFER_ONLY bug
> h4. 3) Read the table in Spark using INFER_ONLY
> {code:sh}
> $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber
> {code}
> h4. Conclusion
> When INFER_ONLY is set, column names are lowercase always.
> h3. Reproduce INFER_AND_SAVE bug
> h4. 1) Run the for first time
> {code:sh}
> $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber{code}
> We see that column names are lowercase
> h4. 2) Run for the second time
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> theString
> theNumber{code}
> We see that the column names are camelCase
> h4. Conclusion
> When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org