You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ivan Sadikov (Jira)" <ji...@apache.org> on 2022/10/06 05:28:00 UTC

[jira] [Commented] (SPARK-40584) Incorrect Count when reading CSV file

    [ https://issues.apache.org/jira/browse/SPARK-40584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613304#comment-17613304 ] 

Ivan Sadikov commented on SPARK-40584:
--------------------------------------

Disabling "multiLine" also fixes the issue.

Seems to be an issue with the CSV file - when setting "unescapedQuoteHandling" to RAISE_ERROR although I did not debug in detail.
{code:java}
Cause: com.univocity.parsers.common.TextParsingException: Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
[info] Internal state when error was thrown: line=2, column=3, record=1, charIndex=121, headers=[1, , {"m": {"difference": 60}}, , , , 2022-02-12T15:40:00.783Z]
[info]   at com.univocity.parsers.csv.CsvParser.handleValueSkipping(CsvParser.java:241)
[info]   at com.univocity.parsers.csv.CsvParser.handleUnescapedQuote(CsvParser.java:319) {code}

> Incorrect Count when reading CSV file
> -------------------------------------
>
>                 Key: SPARK-40584
>                 URL: https://issues.apache.org/jira/browse/SPARK-40584
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2
>            Reporter: Tarique Anwer
>            Priority: Major
>
> I'm trying to read the below data from a CSV file and end up with a wrong count, although the dataframe contains all the records below. df_inputfile.count() prints 3 although it should have been 4.
> {code:java}
> B1123451020-502,"","{""m"": {""difference"": 60}}","","","",2022-02-12T15:40:00.783Z
> B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
> B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
> B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z {code}
> Here's the code:
> {code:java}
> df_inputfile = (spark.read.format("com.databricks.spark.csv")
>                                      .option("inferSchema", "true")
>                                      .option("header","false")                
>                                      .option("quotedstring",'\"')
>                                      .option("escape",'\"')
>                                      .option("multiline","true")
>                                      .option("delimiter",",")
>                                      .load('<path to csv>'))
> print(df_inputfile.count()) # Prints 3
> print(df_inputfile.distinct().count()) # Prints 4 {code}
> Adding a cache() statement before the count results in correct output. Removing the option 'escape' also results in a correct count. 
> {noformat}
> option("escape",'\"'){noformat}
> It looks like this is happening because of the single comma in the 4th column of the 3rd row. Can someone please explain what's going on?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org