You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2018/08/27 16:45:00 UTC

[jira] [Created] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180

Ruslan Dautkhanov created SPARK-25251:
-----------------------------------------

             Summary: Make spark-csv's `quote` and `escape` options conform to RFC 4180
                 Key: SPARK-25251
                 URL: https://issues.apache.org/jira/browse/SPARK-25251
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.3.1, 2.3.0, 2.4.0, 3.0.0
            Reporter: Ruslan Dautkhanov


As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 -

{noformat}
   7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
{noformat}

That's what Excel does, for example, by default.

Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:

{code}
.option('quote', '"') 
.option('escape', '"')
{code}

This may explain that a comma character wasn't interpreted as it was inside a quoted column.

So this is request to make spark-csv reader RFC-4180 compatible in regards to default option values for `quote` and `escape` (make both equal to " ).

Since this is a backward-incompatible change, Spark 3.0 might be a good release for this change.

Some more background - https://stackoverflow.com/a/45138591/470583 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org