You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/07/13 01:36:20 UTC
[jira] [Commented] (SPARK-16512) No way to load CSV data without
dropping whole rows when some of data is not matched with given schema
[ https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374150#comment-15374150 ]
Hyukjin Kwon commented on SPARK-16512:
--------------------------------------
I will work on this as soon as https://github.com/databricks/spark-csv/pull/298 is merged.
> No way to load CSV data without dropping whole rows when some of data is not matched with given schema
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-16512
> URL: https://issues.apache.org/jira/browse/SPARK-16512
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Hyukjin Kwon
> Priority: Minor
>
> Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.
> It seems there are some usecases as below:
> {code}
> a,b
> 1,c
> {code}
> Here, {{a}} can be a dirty data in real usecases.
> But codes below:
> {code}
> val path = "/tmp/test.csv"
> val schema = StructType(
> StructField("a", IntegerType, nullable = true) ::
> StructField("b", StringType, nullable = true) :: Nil
> val df = spark.read
> .format("csv")
> .option("mode", "PERMISSIVE")
> .schema(schema)
> .load(path)
> df.show()
> {code}
> emits the exception below:
> {code}
> java.lang.NumberFormatException: For input string: "a"
> at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:580)
> at java.lang.Integer.parseInt(Integer.java:615)
> at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an exception.
> FYI, this is not the case for JSON because JSON data sources can handle this with {{PERMISSIVE}} mode as below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
> val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
> spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
> {code}
> {code}
> +----+
> | a|
> +----+
> | 1|
> |null|
> +----+
> {code}
> Please refer https://github.com/databricks/spark-csv/pull/298
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org