You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcel Boldt (JIRA)" <ji...@apache.org> on 2016/07/09 12:08:11 UTC
[jira] [Created] (SPARK-16460) Spark 2.0 CSV ignores NULL value in
Date format
Marcel Boldt created SPARK-16460:
------------------------------------
Summary: Spark 2.0 CSV ignores NULL value in Date format
Key: SPARK-16460
URL: https://issues.apache.org/jira/browse/SPARK-16460
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.0.0
Environment: SparkR
Reporter: Marcel Boldt
Priority: Critical
Trying to read a CSV file to Spark (using SparkR) containing just this data row:
{code}
1|1998-01-01||
{code}
Using Spark 1.6.2 (Hadoop 2.6) gives me
{code}
> head(sdf)
id d dtwo
1 1 1998-01-01 NA
{code}
Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error:
{panel}
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
at java.text.DateFormat.parse(DateFormat.java:357)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Itera...
{panel}
The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.
R code:
{code}
#Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6')
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <-
sparkR.init(
master = "local",
sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
)
sqlContext <- sparkRSQL.init(sc)
st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
sdf <- read.df(
sqlContext,
path = "d:/date_test.csv",
source = "com.databricks.spark.csv",
schema = st,
inferSchema = "false",
delimiter = "|",
dateFormat = "yyyy-MM-dd",
nullValue = "",
mode = "PERMISSIVE"
)
head(sdf)
sparkR.stop()
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org