You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "F Jimenez (JIRA)" <ji...@apache.org> on 2019/06/17 09:57:00 UTC
[jira] [Created] (SPARK-28079) CSV fails to detect corrupt record
unless "columnNameOfCorruptRecord" is manually added to the schema
F Jimenez created SPARK-28079:
---------------------------------
Summary: CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
Key: SPARK-28079
URL: https://issues.apache.org/jira/browse/SPARK-28079
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.3, 2.3.2
Reporter: F Jimenez
When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:
{code:java}
// Second row has a 4th column that is not declared in the header/schema
val csvText = s"""
| FieldA, FieldB, FieldC
| a1,b1,c1
| a2,b2,c2,d*""".stripMargin
val csvFile = new File("/tmp/file.csv")
FileUtils.write(csvFile, csvText)
val reader = sqlContext.read
.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corrupt")
.schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
reader.load(csvFile.getAbsolutePath).show(truncate = false)
{code}
This produces the correct result:
{code:java}
+------------+------+------+------+
|corrupt |fieldA|fieldB|fieldC|
+------------+------+------+------+
|null | a1 |b1 |c1 |
| a2,b2,c2,d*| a2 |b2 |c2 |
+------------+------+------+------+
{code}
However removing the "schema" option and going:
{code:java}
val reader = sqlContext.read
.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corrupt")
reader.load(csvFile.getAbsolutePath).show(truncate = false)
{code}
Yields:
{code:java}
+-------+-------+-------+
| FieldA| FieldB| FieldC|
+-------+-------+-------+
| a1 |b1 |c1 |
| a2 |b2 |c2 |
+-------+-------+-------+
{code}
The fourth value "d*" in the second row has been removed and the row not marked as corrupt
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org