You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wei Guo (Jira)" <ji...@apache.org> on 2021/12/15 14:34:00 UTC

[jira] [Updated] (SPARK-37604) The parameter emptyValueInRead is suggested to be designed that any fields matching this string will be set as empty values "" when reading

     [ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wei Guo updated SPARK-37604:
----------------------------
    Summary: The parameter emptyValueInRead is suggested to be designed that any fields matching this string will be set as empty values "" when reading  (was: The parameter emptyValueInRead is suggested to be designed that any fields matching this string will be set as empty values "" when reading to be)

> The parameter emptyValueInRead is suggested to be designed that any fields matching this string will be set as empty values "" when reading
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37604
>                 URL: https://issues.apache.org/jira/browse/SPARK-37604
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.2.0
>            Reporter: Wei Guo
>            Priority: Major
>
> For null values, the parameter nullValue can be set when reading or writing  in CSVOptions:
> {code:scala}
> // For writing, convert: null(dataframe) => nullValue(csv)
> // For reading, convert: nullValue or ,,(csv) => null(dataframe)
> {code}
> For  example, a column has null values, if nullValue is set to "null" string.
> {code:scala}
> Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> and if we read this csv file with nullValue set to "null" string.
> {code:java}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> we can get the DataFrame which data is same with the original shown as:
> ||make||comment||
> |tesla|null|
> {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}
>  
> Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions:
> {code:scala}
> // For writing, convert: ""(dataframe) => emptyValueInWrite(csv)
> // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
> I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows:
> {code:scala}
> // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
> For example,  a column has empty strings, if emptyValueInWrite is set to "EMPTY" string.
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string.
> {code:java}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> we actually get the DataFrame which data is shown as:
> ||make||comment||
> |tesla|EMPTY|
> but the DataFrame which data should be shown as below as  expected:
> ||make||comment||
> |tesla| |
> {color:#de350b}*We can not  recovery it to the original DataFrame.*{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org