You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/08/18 13:41:00 UTC

[jira] [Resolved] (SPARK-21768) spark.csv.read Empty String Parsed as NULL when nullValue is Set

     [ https://issues.apache.org/jira/browse/SPARK-21768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-21768.
-------------------------------
    Resolution: Duplicate

> spark.csv.read Empty String Parsed as NULL when nullValue is Set
> ----------------------------------------------------------------
>
>                 Key: SPARK-21768
>                 URL: https://issues.apache.org/jira/browse/SPARK-21768
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.2, 2.2.0
>         Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2)
> PySpark
>            Reporter: Andrew Gross
>
> In a CSV with quoted fields, empty strings will be interpreted as NULL even when a nullValue is explicitly set:
> Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX
> {{"XXNULLXX"|""|"XXNULLXX"|"foo"}}
> PySpark Script to load the file (from S3):
> {code:title=load.py|borderStyle=solid}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType, StructField, StructType
> spark = SparkSession.builder.appName("test_csv").getOrCreate()
> fields = []
> fields.append(StructField("First Null Field", StringType(), True))
> fields.append(StructField("Empty String Field", StringType(), True))
> fields.append(StructField("Second Null Field", StringType(), True))
> fields.append(StructField("Non Empty String Field", StringType(), True))
> schema = StructType(fields)
> keys = ['s3://mybucket/test/demo.csv']
> bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
> bad_data.show()
> {code}
> Output
> {noformat}
> +----------------+------------------+-----------------+----------------------+
> |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
> +----------------+------------------+-----------------+----------------------+
> |            null|              null|             null|                   foo|
> +----------------+------------------+-----------------+----------------------+
> {noformat}
> Expected Output:
> {noformat}
> +----------------+------------------+-----------------+----------------------+
> |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
> +----------------+------------------+-----------------+----------------------+
> |            null|                  |             null|                   foo|
> +----------------+------------------+-----------------+----------------------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org