You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stuart White (Jira)" <ji...@apache.org> on 2019/09/16 19:39:00 UTC

[jira] [Created] (SPARK-29101) CSV datasource returns incorrect .count() from file with malformed records

Stuart White created SPARK-29101:
------------------------------------

             Summary: CSV datasource returns incorrect .count() from file with malformed records
                 Key: SPARK-29101
                 URL: https://issues.apache.org/jira/browse/SPARK-29101
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.4
            Reporter: Stuart White


Spark 2.4 introduced a change to the way csv files are read.  See [Upgrading From Spark SQL 2.3 to 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] for more details.

In that document, it states: _To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false._

I am configuring Spark 2.4.4 as such, yet I'm still getting results inconsistent with pre-2.4.  For example:

Consider this file (fruit.csv).  Notice it contains a header record, 3 valid records, and one malformed record.

{noformat}
fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx
{noformat}
 
With Spark 2.1.1, if I call .count() on a DataFrame created from this file (using option DROPMALFORMED), "3" is returned.

{noformat}
(using Spark 2.1.1)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx
res1: Long = 3
{noformat}

With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" option to false to restore the pre-2.4 behavior for handling malformed records, then call .count() and "4" is returned.

{noformat}
(using spark 2.4.4)
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4
{noformat}

So, using the *spark.sql.csv.parser.columnPruning.enabled* option did not actually restore previous behavior.

How can I, using Spark 2.4+, get a count of the records in a .csv which excludes malformed records?




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org