You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/07/13 01:36:20 UTC

[jira] [Commented] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema

    [ https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374150#comment-15374150 ] 

Hyukjin Kwon commented on SPARK-16512:
--------------------------------------

I will work on this as soon as https://github.com/databricks/spark-csv/pull/298 is merged.

> No way to load CSV data without dropping whole rows when some of data is not matched with given schema
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16512
>                 URL: https://issues.apache.org/jira/browse/SPARK-16512
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.
> It seems there are some usecases as below:
> {code}
> a,b
> 1,c
> {code}
> Here, {{a}} can be a dirty data in real usecases.
> But codes below:
> {code}
> val path = "/tmp/test.csv"
> val schema = StructType(
>   StructField("a", IntegerType, nullable = true) ::
>   StructField("b", StringType, nullable = true) :: Nil
> val df = spark.read
>   .format("csv")
>   .option("mode", "PERMISSIVE")
>   .schema(schema)
>   .load(path)
> df.show()
> {code}
> emits the exception below:
> {code}
> java.lang.NumberFormatException: For input string: "a"
> 	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 	at java.lang.Integer.parseInt(Integer.java:580)
> 	at java.lang.Integer.parseInt(Integer.java:615)
> 	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 	at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an exception.
> FYI, this is not the case for JSON because JSON data sources can handle this with {{PERMISSIVE}} mode as below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
> val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
> spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
> {code}
> {code}
> +----+
> |   a|
> +----+
> |   1|
> |null|
> +----+
> {code}
> Please refer https://github.com/databricks/spark-csv/pull/298



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org