You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/01/23 07:20:00 UTC

[jira] [Commented] (SPARK-26678) Empty values end up as quoted empty strings in CSV files

    [ https://issues.apache.org/jira/browse/SPARK-26678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749598#comment-16749598 ] 

Hyukjin Kwon commented on SPARK-26678:
--------------------------------------

We should distinguish empty string and missing value. Use {{emptyValue}} option to distinguish.

> Empty values end up as quoted empty strings in CSV files
> --------------------------------------------------------
>
>                 Key: SPARK-26678
>                 URL: https://issues.apache.org/jira/browse/SPARK-26678
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Robert V
>            Priority: Major
>              Labels: csv
>
> h1. Problem statement
> Empty string values were written to CSV as unquoted strings prior Spark version 2.4.0.
> From version 2.4.0 empty string values end up as "" values in CSV files which is a problem if an application was expected to not wrap empty values in quotes (which is certainly the case if the CSV is intended to be used in Microsoft PowerBI for example as it doesn't handle CSV files with double quotes).
> The following code ends up with the following results in the different versions of Spark:
>  
> ||Spark version||Code||Result||
> |2.3.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/23.csv")
> {code}|{noformat}
> name
> aa
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/24.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").option("quote", "").csv("/24-2.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> If the intention was to produce standard-looking CSV files (even though CSV standard doesn't exists) we still need a way to disable automatic quoting.
> Also, using
> {code:java}
> option("quote", "\u0000")
> {code}
> had no effect; double-quotes were used still.
> h1. Proposed solution
> Using the option
> {code:java}
> option("quote", "")
> {code}
> should disable quotes.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org