You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcel Boldt (JIRA)" <ji...@apache.org> on 2016/07/09 12:28:11 UTC
[jira] [Updated] (SPARK-16460) Spark 2.0 CSV ignores NULL value in Date format

     [ https://issues.apache.org/jira/browse/SPARK-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Boldt updated SPARK-16460:
---------------------------------
    Component/s: SQL

> Spark 2.0 CSV ignores NULL value in Date format
> -----------------------------------------------
>
>                 Key: SPARK-16460
>                 URL: https://issues.apache.org/jira/browse/SPARK-16460
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>         Environment: SparkR
>            Reporter: Marcel Boldt
>            Priority: Critical
>
> Trying to read a CSV file to Spark (using SparkR) containing just this data row:
> {code}
>     1|1998-01-01||
> {code}
> Using Spark 1.6.2 (Hadoop 2.6) gives me 
> {code}
>     > head(sdf)
>       id          d dtwo
>     1  1 1998-01-01   NA
> {code}
> Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: 
> {panel}
> > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
> 	at java.text.DateFormat.parse(DateFormat.java:357)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
> 	at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
> 	at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
> 	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> 	at scala.collection.Iterator$$anon$12.hasNext(Itera...
> {panel}
> The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.
> R code:
> {code}
>     #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
>     Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
>     .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
>     library(SparkR)
>     
>     sc <-
>         sparkR.init(
>             master = "local",
>             sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
>         )
>     sqlContext <- sparkRSQL.init(sc)
>     
>     
>     st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
>     
>     sdf <- read.df(
>         sqlContext,
>         path = "d:/date_test.csv",
>         source = "com.databricks.spark.csv",
>         schema = st,
>         inferSchema = "false",
>         delimiter = "|",
>         dateFormat = "yyyy-MM-dd",
>         nullValue = "",
>         mode = "PERMISSIVE"
>     )
>     
>     head(sdf)
>     
>     sparkR.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org