You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Robert V (JIRA)" <ji...@apache.org> on 2019/01/21 17:46:00 UTC

[jira] [Updated] (SPARK-26678) Empty values end up as quoted empty strings in CSV files

     [ https://issues.apache.org/jira/browse/SPARK-26678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert V updated SPARK-26678:
-----------------------------
    Description: 
h1. Problem statement

Empty string values were written to CSV as unquoted strings prior Spark version 2.4.0.

From version 2.4.0 empty string values end up as "" values in CSV files which is a problem if an application was expected to not wrap empty values in quotes (which is certainly the case if the CSV is intended to be used in Microsoft PowerBI for example as it doesn't handle CSV files with double quotes).

The following code ends up with the following results in the different versions of Spark:

 
||Spark version||Code||Result||
|2.3.0|{code:java}
val df = List("aa", "", "bb").toDF("name")
df.coalesce(1).write.option("header", "true").csv("/23.csv")
{code}|{noformat}
name
aa
bb
{noformat}|
|2.4.0|{code:java}
val df = List("aa", "", "bb").toDF("name")
df.coalesce(1).write.option("header", "true").csv("/24.csv")
{code}|{noformat}
name
aa
""
bb
{noformat}|
|2.4.0|{code:java}
val df = List("aa", "", "bb").toDF("name")
df.coalesce(1).write.option("header", "true").option("quote", "").csv("/24-2.csv")
{code}|{noformat}
name
aa
""
bb
{noformat}|

If the intention was to produce standard-looking CSV files (even though CSV standard doesn't exists) we still need a way to disable automatic quoting.
h1. Proposed solution

Using the option
{code:java}
option("quote", "")
{code}
should disable quotes.

 

  was:
h1. Problem statement

Empty string values were written as unquoted strings prior Spark version 2.4.0.

From version 2.4.0 empty string values end up as "" values in CSV files which is a problem if an application was expected to not wrap empty values in quotes (which is certainly a case if the CSV is intended to be used in Microsoft PowerBI for example as it doesn't handle CSV files with double quotes).

The following code ends up with the following results in the different versions of Spark:

 


||Spark version||Code||Result||
|2.3.0|{code:java}
val df = List("aa", "", "bb").toDF("name")
df.coalesce(1).write.option("header", "true").csv("/23.csv")
{code}|{noformat}
name
aa
bb
{noformat}|
|2.4.0|{code:java}
val df = List("aa", "", "bb").toDF("name")
df.coalesce(1).write.option("header", "true").csv("/24.csv")
{code}|{noformat}
name
aa
""
bb
{noformat}|
|2.4.0|{code:java}
val df = List("aa", "", "bb").toDF("name")
df.coalesce(1).write.option("header", "true").option("quote", "").csv("/24-2.csv")
{code}|{noformat}
name
aa
""
bb
{noformat}|

If the intention was to produce standard-looking CSV files (even though CSV standard doesn't exists) we still need a way to disable automatic quoting.
h1. Proposed solution

Using the option
{code:java}
option("quote", "")
{code}
should disable quotes.

 


> Empty values end up as quoted empty strings in CSV files
> --------------------------------------------------------
>
>                 Key: SPARK-26678
>                 URL: https://issues.apache.org/jira/browse/SPARK-26678
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Robert V
>            Priority: Major
>              Labels: csv
>
> h1. Problem statement
> Empty string values were written to CSV as unquoted strings prior Spark version 2.4.0.
> From version 2.4.0 empty string values end up as "" values in CSV files which is a problem if an application was expected to not wrap empty values in quotes (which is certainly the case if the CSV is intended to be used in Microsoft PowerBI for example as it doesn't handle CSV files with double quotes).
> The following code ends up with the following results in the different versions of Spark:
>  
> ||Spark version||Code||Result||
> |2.3.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/23.csv")
> {code}|{noformat}
> name
> aa
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/24.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").option("quote", "").csv("/24-2.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> If the intention was to produce standard-looking CSV files (even though CSV standard doesn't exists) we still need a way to disable automatic quoting.
> h1. Proposed solution
> Using the option
> {code:java}
> option("quote", "")
> {code}
> should disable quotes.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org