You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2018/08/27 16:45:00 UTC
[jira] [Created] (SPARK-25251) Make spark-csv's `quote` and
`escape` options conform to RFC 4180
Ruslan Dautkhanov created SPARK-25251:
-----------------------------------------
Summary: Make spark-csv's `quote` and `escape` options conform to RFC 4180
Key: SPARK-25251
URL: https://issues.apache.org/jira/browse/SPARK-25251
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.3.1, 2.3.0, 2.4.0, 3.0.0
Reporter: Ruslan Dautkhanov
As described inĀ [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 -
{noformat}
7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
{noformat}
That's what Excel does, for example, by default.
Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:
{code}
.option('quote', '"')
.option('escape', '"')
{code}
This may explain that a comma character wasn't interpreted as it was inside a quoted column.
So this is request to make spark-csv reader RFC-4180 compatible in regards to default option values for `quote` and `escape` (make both equal to " ).
Since this is a backward-incompatible change, Spark 3.0 might be a good release for this change.
Some more background - https://stackoverflow.com/a/45138591/470583
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org