You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marnix van den Broek (Jira)" <ji...@apache.org> on 2022/02/11 22:09:00 UTC
[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'

    [ https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491170#comment-17491170 ] 

Marnix van den Broek commented on SPARK-38167:
----------------------------------------------

After getting some help from the community navigating the Spark codebase and testing the same example in the univocity csv parser, I can confirm this is actually a bug in univocity csv parser.

I filed a bug report with them and will update this issue with the status as soon as I know more.   

> CSV parsing error when using escape='"' 
> ----------------------------------------
>
>                 Key: SPARK-38167
>                 URL: https://issues.apache.org/jira/browse/SPARK-38167
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.2.1
>         Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 cluster.
>            Reporter: Marnix van den Broek
>            Priority: Major
>              Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or more characters
> selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).show()
>  
> +----+----+
> |col1|col2|
> +----+----+
> |null|  ,a|
> +----+----+   {code}
>  
>  Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).select('col2').show()
>  
> +----+
> |col2|
> +----+
> |  a"|
> +----+{code}
>  
> If you run this example with more columns in the file, and more commas in the field, e.g. ",,,,,,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods. 
> I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org