You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "L. C. Hsieh (Jira)" <ji...@apache.org> on 2020/05/30 07:39:00 UTC
[jira] [Commented] (SPARK-31799) Spark Datasource Tables Creating
Incorrect Hive Metadata
[ https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120136#comment-17120136 ]
L. C. Hsieh commented on SPARK-31799:
-------------------------------------
This is happened when Spark SQL think it cannot save the data source table in a Hive compatible way. So this kind of data source tables should be only readable by Spark.
> Spark Datasource Tables Creating Incorrect Hive Metadata
> --------------------------------------------------------
>
> Key: SPARK-31799
> URL: https://issues.apache.org/jira/browse/SPARK-31799
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.5
> Reporter: Anoop Johnson
> Priority: Major
>
> I found that if I create a CSV or JSON table using Spark SQL, it writes the wrong Hive table metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example:
> {code:sql}
> CREATE TABLE test_csv (id String, name String)
> USING csv
> LOCATION 's3://[...]'
> ;
> {code}
> If you describe the table using Presto, you will see:
> {code:sql}
> CREATE EXTERNAL TABLE `test_csv`(
> `col` array<string> COMMENT 'from deserializer')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES (
> 'path'='s3://[...]')
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.SequenceFileInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
> LOCATION
> 's3://[...]/test_csv-__PLACEHOLDER__'
> TBLPROPERTIES (
> 'spark.sql.create.version'='2.4.4',
> 'spark.sql.sources.provider'='csv',
> 'spark.sql.sources.schema.numParts'='1',
> 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
> 'transient_lastDdlTime'='1590196086')
> ;
> {code}
> The table location is set to a placeholder value - the schema is always set to _col array<string>_. The serde/inputformat is wrong - it says _SequenceFileInputFormat_ and _LazySimpleSerDe_ even though the requested format is CSV.
> But all the right metadata is written to the custom table properties with prefix _spark.sql_. However, Hive and Presto does not understand these table properties and this breaks them. I could reproduce this with JSON too, but not with Parquet.
> I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala. As a result, these default values are written.
> Is there a reason why CSV and JSON are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org