You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcel Boldt (JIRA)" <ji...@apache.org> on 2016/07/09 12:08:11 UTC
[jira] [Created] (SPARK-16460) Spark 2.0 CSV ignores NULL value in Date format

Marcel Boldt created SPARK-16460:
------------------------------------

             Summary: Spark 2.0 CSV ignores NULL value in Date format
                 Key: SPARK-16460
                 URL: https://issues.apache.org/jira/browse/SPARK-16460
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.0.0
         Environment: SparkR
            Reporter: Marcel Boldt
            Priority: Critical


Trying to read a CSV file to Spark (using SparkR) containing just this data row:

{code}
    1|1998-01-01||
{code}

Using Spark 1.6.2 (Hadoop 2.6) gives me 

{code}
    > head(sdf)
      id          d dtwo
    1  1 1998-01-01   NA
{code}

Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: 

{panel}
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
	at java.text.DateFormat.parse(DateFormat.java:357)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
	at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
	at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Itera...
{panel}

The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.

R code:
{code}
    #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
    Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
    .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
    library(SparkR)
    
    sc <-
        sparkR.init(
            master = "local",
            sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
        )
    sqlContext <- sparkRSQL.init(sc)
    
    
    st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
    
    sdf <- read.df(
        sqlContext,
        path = "d:/date_test.csv",
        source = "com.databricks.spark.csv",
        schema = st,
        inferSchema = "false",
        delimiter = "|",
        dateFormat = "yyyy-MM-dd",
        nullValue = "",
        mode = "PERMISSIVE"
    )
    
    head(sdf)
    
    sparkR.stop()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org