You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "F Jimenez (JIRA)" <ji...@apache.org> on 2019/06/17 09:57:00 UTC

[jira] [Created] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

F Jimenez created SPARK-28079:
---------------------------------

             Summary: CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
                 Key: SPARK-28079
                 URL: https://issues.apache.org/jira/browse/SPARK-28079
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.3, 2.3.2
            Reporter: F Jimenez


When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:
{code:java}
// Second row has a 4th column that is not declared in the header/schema
val csvText = s"""
                 | FieldA, FieldB, FieldC
                 | a1,b1,c1
                 | a2,b2,c2,d*""".stripMargin

val csvFile = new File("/tmp/file.csv")
FileUtils.write(csvFile, csvText)

val reader = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "corrupt")
  .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")

reader.load(csvFile.getAbsolutePath).show(truncate = false)
{code}
This produces the correct result:
{code:java}
+------------+------+------+------+
|corrupt     |fieldA|fieldB|fieldC|
+------------+------+------+------+
|null        | a1   |b1    |c1    |
| a2,b2,c2,d*| a2   |b2    |c2    |
+------------+------+------+------+
{code}
However removing the "schema" option and going:
{code:java}
val reader = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "corrupt")

reader.load(csvFile.getAbsolutePath).show(truncate = false)
{code}
Yields:
{code:java}
+-------+-------+-------+
| FieldA| FieldB| FieldC|
+-------+-------+-------+
| a1    |b1     |c1     |
| a2    |b2     |c2     |
+-------+-------+-------+
{code}
The fourth value "d*" in the second row has been removed and the row not marked as corrupt

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org