You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/10/10 19:37:00 UTC

[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

    [ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199245#comment-16199245 ] 

Sean Owen commented on SPARK-22236:
-----------------------------------

Interesting, because the Univocity parser internally seems to default to RFC4180 settings. But the Spark implementation default overrides this with a default of {{\}}. [~hyukjin.kwon] was that for backwards compatibility with previous implementations? In any event I'm not sure we'd change the default behavior on this side of Spark 3.x, but, you can easily configure the writer to use double-quote for escape.

> CSV I/O: does not respect RFC 4180
> ----------------------------------
>
>                 Key: SPARK-22236
>                 URL: https://issues.apache.org/jira/browse/SPARK-22236
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
>     cw = csv.writer(f)
>     cw.writerow(['a 2.5" drive', 'another column'])
>     cw.writerow(['a "quoted" string', '"quoted"'])
>     cw.writerow([1,2])
> with open('testfile.csv') as f:
>     print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-....csv') as f:
>     cr = csv.reader(f)
>     print(next(cr))
>     print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org