You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/07/13 01:20:20 UTC
[jira] [Created] (SPARK-16512) No way to load CSV data without
dropping whole rows when some of data is not matched with given schema
Hyukjin Kwon created SPARK-16512:
------------------------------------
Summary: No way to load CSV data without dropping whole rows when some of data is not matched with given schema
Key: SPARK-16512
URL: https://issues.apache.org/jira/browse/SPARK-16512
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor
Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.
It seems there are some usecases as below:
{code}
a,b
1,c
{code}
Here, it seems the {{a}} can be a dirty data.
But codes below:
{code}
val path = testFile(carsFile)
val schema = StructType(
StructField("a", IntegerType, nullable = true) ::
StructField("b", StringType, nullable = true) :: Nil)
val df = spark.read
.format("csv")
.option("mode", "PERMISSIVE")
.schema(schema)
.load(path)
df.show()
{code}
emits the exception below:
{code}
java.lang.NumberFormatException: For input string: "a"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}
With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an exception.
FYI, this is not the case for JSON because JSON data sources can handle this with {{PERMISSIVE}} mode as below:
{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
{code}
{code}
+----+
| a|
+----+
| 1|
|null|
+----+
{code}
Please refer https://github.com/databricks/spark-csv/pull/298
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org