You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stuart White (JIRA)" <ji...@apache.org> on 2019/06/17 15:09:00 UTC
[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

    [ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865679#comment-16865679 ] 

Stuart White edited comment on SPARK-28058 at 6/17/19 3:08 PM:
---------------------------------------------------------------

Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], under the [Upgrading From Spark SQL 2.3 to 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least one column 
value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or
outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed
only when it contains malformed column values requested from CSV datasource, other values
can be ignored. As an example, CSV file contains the “id,name” header and one row “1234”.
In Spark 2.4, selection of the id column consists of a row with one column value 1234 but
in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous
behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the {{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!


was (Author: stwhit):
Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], under the [Upgrading From Spark SQL 2.3 to 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the “id,name” header and one row “1234”. In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the {{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> -----------------------------------------------------------------------
>
>                 Key: SPARK-28058
>                 URL: https://issues.apache.org/jira/browse/SPARK-28058
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.1, 2.4.3
>            Reporter: Stuart White
>            Priority: Minor
>              Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +------+------+-----+--------+                                                  
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> However, if I select a subset of the columns, the malformed record is not dropped.  The malformed data is placed in the first column, and the remaining column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +------+
> |fruit |
> +------+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +------+
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +------+------+
> |fruit |color |
> +------+------+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +------+------+
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price).show(truncate=false)
> +------+------+-----+
> |fruit |color |price|
> +------+------+-----+
> |apple |red   |1    |
> |banana|yellow|2    |
> |orange|orange|3    |
> |xxx   |null  |null |
> +------+------+-----+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 'quantity).show(truncate=false)
> +------+------+-----+--------+
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org