You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2019/06/18 15:49:00 UTC

[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

    [ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866777#comment-16866777 ] 

Liang-Chi Hsieh commented on SPARK-28079:
-----------------------------------------

Isn't it the expected behavior as documented in {{PERMISSIVE}} mode of CSV?

> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28079
>                 URL: https://issues.apache.org/jira/browse/SPARK-28079
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2, 2.4.3
>            Reporter: F Jimenez
>            Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
>                  | FieldA, FieldB, FieldC
>                  | a1,b1,c1
>                  | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
>   .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> +------------+------+------+------+
> |corrupt     |fieldA|fieldB|fieldC|
> +------------+------+------+------+
> |null        | a1   |b1    |c1    |
> | a2,b2,c2,d*| a2   |b2    |c2    |
> +------------+------+------+------+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +-------+-------+-------+
> | FieldA| FieldB| FieldC|
> +-------+-------+-------+
> | a1    |b1     |c1     |
> | a2    |b2     |c2     |
> +-------+-------+-------+
> {code}
> The fourth value "d*" in the second row has been removed and the row not marked as corrupt
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org