You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ondrej Kokes (JIRA)" <ji...@apache.org> on 2017/10/11 06:32:00 UTC
[jira] [Updated] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ondrej Kokes updated SPARK-22236:
---------------------------------
Description:
When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.
This piece of Python code demonstrates the issue
{code}
import csv
with open('testfile.csv', 'w') as f:
cw = csv.writer(f)
cw.writerow(['a 2.5" drive', 'another column'])
cw.writerow(['a "quoted" string', '"quoted"'])
cw.writerow([1,2])
with open('testfile.csv') as f:
print(f.read())
# "a 2.5"" drive",another column
# "a ""quoted"" string","""quoted"""
# 1,2
spark.read.csv('testfile.csv').collect()
# [Row(_c0='"a 2.5"" drive"', _c1='another column'),
# Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
# Row(_c0='1', _c1='2')]
# explicitly stating the escape character fixed the issue
spark.read.option('escape', '"').csv('testfile.csv').collect()
# [Row(_c0='a 2.5" drive', _c1='another column'),
# Row(_c0='a "quoted" string', _c1='"quoted"'),
# Row(_c0='1', _c1='2')]
{code}
The same applies to writes, where reading the file written by Spark may result in garbage.
{code}
df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
df.write.format("csv").save('testout.csv')
with open('testout.csv/part-....csv') as f:
cr = csv.reader(f)
print(next(cr))
print(next(cr))
# ['a 2.5\\ drive"', 'another column']
# ['a \\quoted\\" string"', '\\quoted\\""']
{code}
The culprit is in [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], where the default escape character is overridden.
While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.
was:
When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.
This piece of Python code demonstrates the issue
{code}
import csv
with open('testfile.csv', 'w') as f:
cw = csv.writer(f)
cw.writerow(['a 2.5" drive', 'another column'])
cw.writerow(['a "quoted" string', '"quoted"'])
cw.writerow([1,2])
with open('testfile.csv') as f:
print(f.read())
# "a 2.5"" drive",another column
# "a ""quoted"" string","""quoted"""
# 1,2
spark.read.csv('testfile.csv').collect()
# [Row(_c0='"a 2.5"" drive"', _c1='another column'),
# Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
# Row(_c0='1', _c1='2')]
# explicitly stating the escape character fixed the issue
spark.read.option('escape', '"').csv('testfile.csv').collect()
# [Row(_c0='a 2.5" drive', _c1='another column'),
# Row(_c0='a "quoted" string', _c1='"quoted"'),
# Row(_c0='1', _c1='2')]
{code}
The same applies to writes, where reading the file written by Spark may result in garbage.
{code}
df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
df.write.format("csv").save('testout.csv')
with open('testout.csv/part-....csv') as f:
cr = csv.reader(f)
print(next(cr))
print(next(cr))
# ['a 2.5\\ drive"', 'another column']
# ['a \\quoted\\" string"', '\\quoted\\""']
{code}
While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.
> CSV I/O: does not respect RFC 4180
> ----------------------------------
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 2.2.0
> Reporter: Ondrej Kokes
> Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> # Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> # Row(_c0='a "quoted" string', _c1='"quoted"'),
> # Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-....csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org