You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcel Boldt (JIRA)" <ji...@apache.org> on 2016/07/09 12:28:11 UTC
[jira] [Updated] (SPARK-16460) Spark 2.0 CSV ignores NULL value in
Date format
[ https://issues.apache.org/jira/browse/SPARK-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcel Boldt updated SPARK-16460:
---------------------------------
Component/s: SQL
> Spark 2.0 CSV ignores NULL value in Date format
> -----------------------------------------------
>
> Key: SPARK-16460
> URL: https://issues.apache.org/jira/browse/SPARK-16460
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Environment: SparkR
> Reporter: Marcel Boldt
> Priority: Critical
>
> Trying to read a CSV file to Spark (using SparkR) containing just this data row:
> {code}
> 1|1998-01-01||
> {code}
> Using Spark 1.6.2 (Hadoop 2.6) gives me
> {code}
> > head(sdf)
> id d dtwo
> 1 1 1998-01-01 NA
> {code}
> Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error:
> {panel}
> > Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
> at java.text.DateFormat.parse(DateFormat.java:357)
> at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
> at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
> at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
> at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
> at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Itera...
> {panel}
> The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.
> R code:
> {code}
> #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6')
> Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
>
> sc <-
> sparkR.init(
> master = "local",
> sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
> )
> sqlContext <- sparkRSQL.init(sc)
>
>
> st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
>
> sdf <- read.df(
> sqlContext,
> path = "d:/date_test.csv",
> source = "com.databricks.spark.csv",
> schema = st,
> inferSchema = "false",
> delimiter = "|",
> dateFormat = "yyyy-MM-dd",
> nullValue = "",
> mode = "PERMISSIVE"
> )
>
> head(sdf)
>
> sparkR.stop()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org