You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/08/03 02:45:00 UTC
[jira] [Updated] (SPARK-28058) Reading csv with DROPMALFORMED
sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-28058:
----------------------------------
Labels: (was: CSV csv csvparser)
> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> -----------------------------------------------------------------------
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.1, 2.4.3
> Reporter: Stuart White
> Assignee: Liang-Chi Hsieh
> Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything looks good. The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +------+------+-----+--------+
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red |1 |3 |
> |banana|yellow|2 |4 |
> |orange|orange|3 |5 |
> +------+------+-----+--------+
> {noformat}
> However, if I select a subset of the columns, the malformed record is not dropped. The malformed data is placed in the first column, and the remaining column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +------+
> |fruit |
> +------+
> |apple |
> |banana|
> |orange|
> |xxx |
> +------+
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +------+------+
> |fruit |color |
> +------+------+
> |apple |red |
> |banana|yellow|
> |orange|orange|
> |xxx |null |
> +------+------+
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price).show(truncate=false)
> +------+------+-----+
> |fruit |color |price|
> +------+------+-----+
> |apple |red |1 |
> |banana|yellow|2 |
> |orange|orange|3 |
> |xxx |null |null |
> +------+------+-----+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 'quantity).show(truncate=false)
> +------+------+-----+--------+
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red |1 |3 |
> |banana|yellow|2 |4 |
> |orange|orange|3 |5 |
> +------+------+-----+--------+
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which columns are being selected from the file.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org