You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anoop Johnson (Jira)" <ji...@apache.org> on 2020/05/23 01:32:00 UTC
[jira] [Updated] (SPARK-31799) Spark Datasource Tables Creating
Incorrect Hive Metadata
[ https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anoop Johnson updated SPARK-31799:
----------------------------------
Description:
If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example:
{{{{CREATE TABLE test_csv (id String, name String)}}}}
{{ {{USING csv}}}}
{{ {{ LOCATION 's3://[...]'}}}}
{{ {{;}}}}
If you describe the table using Presto, you will see:
{{{code:sql}}}
{{CREATE EXTERNAL TABLE `test_csv`(}}
\{{ `col` array<string> COMMENT 'from deserializer')}}
\{{ROW FORMAT SERDE }}
\{{ 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' }}
\{{WITH SERDEPROPERTIES ( }}
\{{ 'path'='s3://...') }}
\{{STORED AS INPUTFORMAT }}
\{{ 'org.apache.hadoop.mapred.SequenceFileInputFormat' }}
\{{OUTPUTFORMAT }}
\{{ 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'}}
{{LOCATION}}
{{ 's3://.../test_csv-__PLACEHOLDER__'}}
{{TBLPROPERTIES (}}
\{{ 'spark.sql.create.version'='2.4.4', }}
\{{ 'spark.sql.sources.provider'='csv', }}
\{{ 'spark.sql.sources.schema.numParts'='1', }}
{{ 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[\{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\\{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', }}
\{{ 'transient_lastDdlTime'='1590196086')}}
\{{ ;}}
{{{code}}}
The table location is set to a placeholder value - the schema is always set to \{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence file even though the requested format is CSV.
But all the right metadata is written to the custom table properties with prefix `spark.sql`. However, Hive and Presto does not understand these table properties and this breaks them.
I could reproduce this with JSON too, but not with Parquet. I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala.
Is there a reason why they are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail.
was:
If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example:
{{CREATE TABLE test_csv (id String, name String)}}
{{USING csv}}
{{ LOCATION 's3://[...]'}}
{{;}}
If you describe the table using Presto, you will see:
{{CREATE EXTERNAL TABLE `test_csv`(}}
{{ `col` array<string> COMMENT 'from deserializer')}}
{{ROW FORMAT SERDE }}
{{ 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' }}
{{WITH SERDEPROPERTIES ( }}
{{ 'path'='s3://...') }}
{{STORED AS INPUTFORMAT }}
{{ 'org.apache.hadoop.mapred.SequenceFileInputFormat' }}
{{OUTPUTFORMAT }}
{{ 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'}}
{{LOCATION}}
{{ 's3://.../test_csv-__PLACEHOLDER__'}}
{{TBLPROPERTIES (}}
{{ 'spark.sql.create.version'='2.4.4', }}
{{ 'spark.sql.sources.provider'='csv', }}
{{ 'spark.sql.sources.schema.numParts'='1', }}
{{ 'spark.sql.sources.schema.part.0'='\{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', }}
{{ 'transient_lastDdlTime'='1590196086')}}
{{ ;}}
The table location is set to a placeholder value - the schema is always set to {{`col` array<string>. }}The serde/inputformat is wrong - it says sequence file even though the requested format is CSV.
But all the right metadata is written to the custom table properties with prefix `spark.sql`. However, Hive and Presto does not understand these table properties and this breaks them.
I could reproduce this with JSON too, but not with Parquet. I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala.
Is there a reason why they are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail.
> Spark Datasource Tables Creating Incorrect Hive Metadata
> --------------------------------------------------------
>
> Key: SPARK-31799
> URL: https://issues.apache.org/jira/browse/SPARK-31799
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.5
> Reporter: Anoop Johnson
> Priority: Major
>
> If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example:
> {{{{CREATE TABLE test_csv (id String, name String)}}}}
> {{ {{USING csv}}}}
> {{ {{ LOCATION 's3://[...]'}}}}
> {{ {{;}}}}
> If you describe the table using Presto, you will see:
>
> {{{code:sql}}}
> {{CREATE EXTERNAL TABLE `test_csv`(}}
> \{{ `col` array<string> COMMENT 'from deserializer')}}
> \{{ROW FORMAT SERDE }}
> \{{ 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' }}
> \{{WITH SERDEPROPERTIES ( }}
> \{{ 'path'='s3://...') }}
> \{{STORED AS INPUTFORMAT }}
> \{{ 'org.apache.hadoop.mapred.SequenceFileInputFormat' }}
> \{{OUTPUTFORMAT }}
> \{{ 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'}}
> {{LOCATION}}
> {{ 's3://.../test_csv-__PLACEHOLDER__'}}
> {{TBLPROPERTIES (}}
> \{{ 'spark.sql.create.version'='2.4.4', }}
> \{{ 'spark.sql.sources.provider'='csv', }}
> \{{ 'spark.sql.sources.schema.numParts'='1', }}
> {{ 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[\{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\\{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', }}
> \{{ 'transient_lastDdlTime'='1590196086')}}
> \{{ ;}}
> {{{code}}}
> The table location is set to a placeholder value - the schema is always set to \{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence file even though the requested format is CSV.
> But all the right metadata is written to the custom table properties with prefix `spark.sql`. However, Hive and Presto does not understand these table properties and this breaks them.
> I could reproduce this with JSON too, but not with Parquet. I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala.
> Is there a reason why they are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org