You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ranga Reddy (Jira)" <ji...@apache.org> on 2021/08/10 09:04:00 UTC
[jira] [Comment Edited] (SPARK-26208) Empty dataframe does not
roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396076#comment-17396076 ]
Ranga Reddy edited comment on SPARK-26208 at 8/10/21, 9:03 AM:
---------------------------------------------------------------
Hi [~koertkuipers]
The above code will work only when dataframe created manually.
Issue still persists when when we create dataframe while reading hive table.
*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`(
`col1` bigint,
`col2` bigint)
STORED AS ORC
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
... 49 elided{code}
was (Author: rangareddy.avula@gmail.com):
The above code will work only when dataframe created manually.
Issue still persists when when we create dataframe while reading hive table.
*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`(
`col1` bigint,
`col2` bigint)
STORED AS ORC
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
... 49 elided{code}
> Empty dataframe does not roundtrip for csv with header
> ------------------------------------------------------
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date: Sun Nov 18 16:02:15 2018 +0800
> Reporter: koert kuipers
> Assignee: Koert Kuipers
> Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org