You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ori Popowski (Jira)" <ji...@apache.org> on 2020/01/18 12:13:00 UTC
[jira] [Updated] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

     [ https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ori Popowski updated SPARK-30559:
---------------------------------
    Description: 
In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file
{code:java}
scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files
{code:java}
$ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00000-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00001-….snappy.parquet
{"theString":"b","theNumber":2}{code}
We see that they are saved with camelCase column names.
h4. 3) Create a Hive table 
{code:java}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug
h4. 3) Read the table in Spark using INFER_ONLY
{code:java}
$ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
{code:java}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the for first time
{code:java}
$ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
thestring
thenumber{code}
We see that column names are lowercase
h4. 2) Run for the second time
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
theString
theNumber{code}
We see that the column names are camelCase
h4. Conclusion

When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls.

 

 

  was:
Spark SQL's spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work.
 # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema
 # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work)

h3. Expected behavior (according to SPARK-19611)

INFER_ONLY - infer the schema from the underlying files

INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls
h2. Reproduce
h3. Prepare the data
h4. 1) Create a Parquet file

 
{code:java}
scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code}
 
h4. 2) Inspect the Parquet files

 
{code:java}
$ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00000-….snappy.parquet
{"theString":"a","theNumber":1}
$ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00001-….snappy.parquet
{"theString":"b","theNumber":2}{code}
 

We see that they are saved with camelCase column names.
h4. 3) Create a Hive table

 
{code:java}
hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
 > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
 > LOCATION 'hdfs:///t';{code}
 
h3. Reproduce INFER_ONLY bug

 
h4. 3) Read the table in Spark using INFER_ONLY

 
{code:java}
$ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
 

 
{code:java}
scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)

thestring
thenumber
{code}
h4. Conclusion

When INFER_ONLY is set, column names are lowercase always.
h3. Reproduce INFER_AND_SAVE bug
h4. 1) Run the for first time
{code:java}
$ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
 

 
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
thestring
thenumber{code}
 

We see that column names are lowercase
h4. 2) Run for the second time

 
{code:java}
scala> spark.sql("select * from default.t").columns.foreach(println)
theString
theNumber{code}
 

We see that the column names are camelCase
h4. Conclusion

When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls.

 

 


> Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-30559
>                 URL: https://issues.apache.org/jira/browse/SPARK-30559
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6
>            Reporter: Ori Popowski
>            Priority: Major
>
> In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work.
>  # INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema
>  # INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work)
> h3. Expected behavior (according to SPARK-19611)
> INFER_ONLY - infer the schema from the underlying files
> INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls
> h2. Reproduce
> h3. Prepare the data
> h4. 1) Create a Parquet file
> {code:java}
> scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t"){code}
>  
> h4. 2) Inspect the Parquet files
> {code:java}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00000-….snappy.parquet
> {"theString":"a","theNumber":1}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00001-….snappy.parquet
> {"theString":"b","theNumber":2}{code}
> We see that they are saved with camelCase column names.
> h4. 3) Create a Hive table 
> {code:java}
> hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
>  > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>  > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  > LOCATION 'hdfs:///t';{code}
>  
> h3. Reproduce INFER_ONLY bug
> h4. 3) Read the table in Spark using INFER_ONLY
> {code:java}
> $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
> {code:java}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber
> {code}
> h4. Conclusion
> When INFER_ONLY is set, column names are lowercase always.
> h3. Reproduce INFER_AND_SAVE bug
> h4. 1) Run the for first time
> {code:java}
> $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
> {code:java}
> scala> spark.sql("select * from default.t").columns.foreach(println)
> thestring
> thenumber{code}
> We see that column names are lowercase
> h4. 2) Run for the second time
> {code:java}
> scala> spark.sql("select * from default.t").columns.foreach(println)
> theString
> theNumber{code}
> We see that the column names are camelCase
> h4. Conclusion
> When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org