You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2019/06/18 15:49:00 UTC
[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record
unless "columnNameOfCorruptRecord" is manually added to the schema
[ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866777#comment-16866777 ]
Liang-Chi Hsieh commented on SPARK-28079:
-----------------------------------------
Isn't it the expected behavior as documented in {{PERMISSIVE}} mode of CSV?
> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
> -----------------------------------------------------------------------------------------------------
>
> Key: SPARK-28079
> URL: https://issues.apache.org/jira/browse/SPARK-28079
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.2, 2.4.3
> Reporter: F Jimenez
> Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
> | FieldA, FieldB, FieldC
> | a1,b1,c1
> | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
> .format("csv")
> .option("header", "true")
> .option("mode", "PERMISSIVE")
> .option("columnNameOfCorruptRecord", "corrupt")
> .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> +------------+------+------+------+
> |corrupt |fieldA|fieldB|fieldC|
> +------------+------+------+------+
> |null | a1 |b1 |c1 |
> | a2,b2,c2,d*| a2 |b2 |c2 |
> +------------+------+------+------+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
> .format("csv")
> .option("header", "true")
> .option("mode", "PERMISSIVE")
> .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +-------+-------+-------+
> | FieldA| FieldB| FieldC|
> +-------+-------+-------+
> | a1 |b1 |c1 |
> | a2 |b2 |c2 |
> +-------+-------+-------+
> {code}
> The fourth value "d*" in the second row has been removed and the row not marked as corrupt
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org