You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/07/13 01:20:20 UTC

[jira] [Created] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema

Hyukjin Kwon created SPARK-16512:
------------------------------------

             Summary: No way to load CSV data without dropping whole rows when some of data is not matched with given schema
                 Key: SPARK-16512
                 URL: https://issues.apache.org/jira/browse/SPARK-16512
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Hyukjin Kwon
            Priority: Minor


Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.

It seems there are some usecases as below:

{code}
a,b
1,c
{code}

Here, it seems the {{a}} can be a dirty data.

But codes below:

{code}
val path = testFile(carsFile)
val schema = StructType(
  StructField("a", IntegerType, nullable = true) ::
  StructField("b", StringType, nullable = true) :: Nil)
val df = spark.read
  .format("csv")
  .option("mode", "PERMISSIVE")
  .schema(schema)
  .load(path)
df.show()
{code}

emits the exception below:

{code}
java.lang.NumberFormatException: For input string: "a"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
	at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}

With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an exception.

FYI, this is not the case for JSON because JSON data sources can handle this with {{PERMISSIVE}} mode as below:

{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
{code}

{code}
+----+
|   a|
+----+
|   1|
|null|
+----+
{code}

Please refer https://github.com/databricks/spark-csv/pull/298



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org