You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Suchintak Patnaik (Jira)" <ji...@apache.org> on 2019/09/18 17:07:00 UTC

[jira] [Reopened] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

     [ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suchintak Patnaik reopened SPARK-29058:
---------------------------------------

Though the workaround of caching the dataframe first and then using count() works well, that is not feasible if the base datasaet size is large.

The dataframe count should give the correct count after discarding the corrupt records.

> Reading csv file with DROPMALFORMED showing incorrect record count
> ------------------------------------------------------------------
>
>                 Key: SPARK-29058
>                 URL: https://issues.apache.org/jira/browse/SPARK-29058
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Suchintak Patnaik
>            Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +------+------+-----+--------+
> | Fruit| color|price|quantity|
> +------+------+-----+--------+
> | apple|   red|    1|       3|
> |orange|orange|    3|       5|
> +------+------+-----+--------+
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org