You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ranga Reddy (Jira)" <ji...@apache.org> on 2021/08/10 09:04:00 UTC
[jira] [Comment Edited] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

    [ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396076#comment-17396076 ] 

Ranga Reddy edited comment on SPARK-26208 at 8/10/21, 9:03 AM:
---------------------------------------------------------------

Hi [~koertkuipers]

The above code will work only when dataframe created manually.

Issue still persists when when we create dataframe while reading hive table.

*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`( 
 `col1` bigint, 
 `col2` bigint) 
STORED AS ORC 
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*

 
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
 at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
 at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
 ... 49 elided{code}


was (Author: rangareddy.avula@gmail.com):
The above code will work only when dataframe created manually.

Issue still persists when when we create dataframe while reading hive table.

*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`( 
 `col1` bigint, 
 `col2` bigint) 
STORED AS ORC 
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*

 
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
 at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
 at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
 ... 49 elided{code}

> Empty dataframe does not roundtrip for csv with header
> ------------------------------------------------------
>
>                 Key: SPARK-26208
>                 URL: https://issues.apache.org/jira/browse/SPARK-26208
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org