You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ep1804 <gi...@git.apache.org> on 2017/03/06 11:37:55 UTC

[GitHub] spark pull request #17177: [SPARK-19384][SQL] csv encoding/decoding using es...

GitHub user ep1804 opened a pull request:

    https://github.com/apache/spark/pull/17177

    [SPARK-19384][SQL] csv encoding/decoding using escape of escape

    Escape of escape should be considered when using the UniVocity csv encoding/decoding library.
    
    I added lines for this and unit tested it.
    
    references: 
    * https://issues.apache.org/jira/browse/SPARK-19834
    * http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ep1804/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17177.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17177
    
----
commit 9dc513cceb29f0989f9e0f00ec731c24b8139d76
Author: soonmok-kwon <so...@navercorp.com>
Date:   2017-03-06T11:30:20Z

    SPARK-19384 csv escape of escape

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv encoding/decoding using escape of...

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Thank you for early and detailed response @HyukjinKwon .
    
    1. About the purpose of PR, Yes, it's about using escape-a-quote-escape option. I used the wording 'encoding/decoding' with a general meaning and this might be confusing.
    
    2. The indentation errors and readability problem will be fixed following your comments.
    
    3. About the last question, I think this is a problem, and this is because the uniVocity library doesn't work as expected, and I seek your advice. Here's my experiment:
    
    ### Test code
    
    For the experiment, I made an additional option `escapeEscape`.
    
    ```scala
          val df1 = spark.sqlContext.createDataFrame(List(
            (1, """AA"BB"""),      // 1 quote char (OK without escapeEscape option)
            (2, """AA\"BB"""),     // 1 escape char and 1 quote char
            (3, """AA\\"BB"""),    // 2 escape char and 1 quote char
            (4, """AA""BB"""),     // 2 quote char (OK without escapeEscape option)
            (5, """AA\"\"BB"""),   // (1 escape char anc 1 quote char) * 2
            (6, """AA\\"\\"BB""")  // (2 escape char and 1 quote char) * 2
          ))
    
          df1.coalesce(1).write
            .format("csv")
            .option("quote", "\"")
            .option("escape", "\\")
            .option("escapeEscape", "\\")
            .save(csvDir)
    
          val df2 = spark.read
            .format("csv")
            .option("quote", "\"")
            .option("escape", "\\")
            .option("escapeEscape", "\\")
            .load(csvDir).orderBy($"_c0")
    
    ```
    ### Firstly, I set `escapeEscape` to `\`, as documented by uniVocity.
    
    ```
    +---+----------+
    | _1|        _2|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    
    "AA\"BB"
    "AA\\"BB"
    "AA\\\"BB"
    "AA\"\"BB"
    "AA\\"\\\"BB"
    "AA\\\"\\\\\"BB"
    
    +---+---------+
    |_c0|      _c1|
    +---+---------+
    |  1|    AA"BB|
    |  2| "AA\"BB"|
    |  3|   AA\"BB|
    |  4|   AA""BB|
    |  5|  AA\\"BB|
    |  6|AA\"\\"BB|
    +---+---------+
    ```
    In the result, the first table is df1, the second table is df2, and strings between them are raw csv file content.
    
    In the result, df1 and df2 are different. IMHO, this will not be a result users expect.
    
    
    ### Secondly, I set `escapeEscape` to `A` to test what happens when the escape character meets one pre-existing one.
    ```
    +---+----------+
    | _1|        _2|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    
    "AA\"BB"
    "AA\\"BB"
    "AA\\\"BB"
    "AA\"\"BB"
    "AA\\"A\\"BB"
    "AA\\\"A\A\\"BB"
    
    +---+--------+
    |_c0|     _c1|
    +---+--------+
    |  1|  "\"BB"|
    |  2|    \"BB|
    |  3|   \\"BB|
    |  4|    \"BB|
    |  5|  \"\"BB|
    |  6|\\"\\"BB|
    +---+--------+
    ```
    
    Again, df1 and df2 are different. Something bad had happened.
    
    ### Thirdly, I tested with `escapeEscape = quote` setting
    
    ```
    +---+----------+
    | _1|        _2|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    
    "AA\"BB"
    "AA\\"BB"
    "AA\\\"BB"
    "AA\"\"BB"
    "AA\\""\\"BB"
    "AA\\\""\"\\"BB"
    
    +---+----------+
    |_c0|       _c1|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    ```
    
    
    ### Finally, I tested `escapeEscape = quote` case with more examples.
    
    ```scala
          val df1 = spark.sqlContext.createDataFrame(List(
            (1, """AA\"\"\"\"\"BB"""),   // (1 escape char anc 1 quote char) * 5
            (2, """AA\\"\\"\\"\\"\\"BB"""),  // (2 escape char and 1 quote char) * 5
            (3, "You are \"beautiful\""),
            (4, "Yes, \\\"inside\"\\")
          ))
    ```
    
    And the result is OK
    
    ```
    +---+-------------------+
    | _1|                 _2|
    +---+-------------------+
    |  1|     AA\"\"\"\"\"BB|
    |  2|AA\\"\\"\\"\\"\\"BB|
    |  3|You are "beautiful"|
    |  4|    Yes, \"inside"\|
    +---+-------------------+
    
    "AA\\""\\""\\""\\""\\"BB"
    "AA\\\""\"\\""\"\\""\"\\""\"\\"BB"
    "You are \"beautiful\""
    "Yes, "\\"inside\""\"
    
    +---+-------------------+
    |_c0|                _c1|
    +---+-------------------+
    |  1|     AA\"\"\"\"\"BB|
    |  2|AA\\"\\"\\"\\"\\"BB|
    |  3|You are "beautiful"|
    |  4|    Yes, \"inside"\|
    +---+-------------------+
    ```
    
    This is why I used `quote` for `escapeEscape` parameter instead of `escape` character.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Thank you so much both for your efforts and time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105307099
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, using char to escape quote-escape") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +
    +
    +      val df1 = spark.sqlContext.createDataFrame(List(
    +        (1, """AA\"BB"""),              // 1 escape char anc 1 quote char
    +        (2, """AA\\"BB"""),             // 2 escape char and 1 quote char
    +        (3, """AA\"\"\"\"\"BB"""),      // (1 escape char anc 1 quote char) * 5
    +        (4, """AA\\"\\"\\"\\"\\"BB"""), // (2 escape char and 1 quote char) * 5
    +        (5, """You are "beautiful"""),
    +        (6, """Yes, \"inside"\""")
    +      ))
    +
    +      // escapeQuotes should be true by default
    +      // escapeEscape should be equal to quote by default
    +      df1.coalesce(1).write
    +        .format("csv")
    +        .option("quote", "\"")
    +        .option("escape", "\\")
    +        .option("escapeEscape", "\"")
    +        .save(csvDir)
    +
    +      val df2 = spark.read
    +        .format("csv")
    +        .option("quote", "\"")
    +        .option("escape", "\\")
    +        .option("escapeEscape", "\"")
    +        .load(csvDir).orderBy($"_c0")
    +
    +      val df1Str = df1.collect().map(_.getString(1)).mkString("")
    +      val df2Str = df2.collect().map(_.getString(1)).mkString("")
    +
    +      assert(df1Str == df2Str)
    --- End diff --
    
    Maybe we could use `checkAnswer(df1, df2)`. Does this follow your indention?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r104436338
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, avoiding double backslash") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +
    +      val df1 = spark.sqlContext.createDataFrame(List(
    +        (1, "aa\\\"bb,cc"),         // aa\"bb,cc
    +        (2, "aa\\\\\"bb,cc"),       // aa\\\"bb,cc
    +        (3, "aa\\\"\\\"bb,cc"),     // aa\"\"bb,cc
    +        (4, "aa\\\\\"\\\\\"bb,cc")  // aa\\\"\\\"bb,cc
    +      ))
    +
    +      df1.coalesce(1).write
    +              .format("csv")
    +              .option("quote", "\"")
    +              .option("escape", "\\")
    +              .save(csvDir)
    +
    +      val df2 = spark.read.csv(csvDir).orderBy($"_c0")
    +
    +      val df1Str = df1.collect().map(_.getString(1)).mkString(" ")
    +
    +      val df2Str = df2.select("_c1").collect().map(_.getString(0)).mkString(" ")
    +
    +      val text = spark.read
    --- End diff --
    
    We should double-spaced line indentation. (please refer https://github.com/databricks/scala-style-guide)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105337093
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -693,8 +697,8 @@ def text(self, path, compression=None):
     
         @since(2.0)
         def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None,
    -            header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None,
    -            timestampFormat=None, timeZone=None):
    +            escapeQuoteEscaping=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None,
    --- End diff --
    
    Oh, @ep1804, we should add new option at the end of these arguments because otherwise it breaks existing codes using this API in lower versions that use those options as positional arguments (not keyword arguments).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    @ep1804 @jbax Thank you. I will cc and inform you both when I happen to see a PR bumping up the version to 2.4.0 (or probably I guess I will). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by jbax <gi...@git.apache.org>.
Github user jbax commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Doesn't seem correct to me. All test cases are using broken CSV and trigger the parser handling of unescaped quotes, where it tries to rescue the data and produce something sensible. See my test case here: https://github.com/uniVocity/univocity-parsers/issues/143


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105306617
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, using char to escape quote-escape") {
    +    withTempDir { dir =>
    --- End diff --
    
    Let's use `withTempPath { path =>` directly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    As mentioned in https://github.com/uniVocity/univocity-parsers/issues/143, for proper handling of escape characters, the uniVocity option `escapeUnquotedValues` is also required.
    
    I added that option setter, unit test, and documentation. As for the documentation, I added documentation for `escapeUnquotedValues` to the methods where `escapeQuotes` is documented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    I think it is good to add as it is also described in univocity library but I am not too sure if it is worth exposing an option that has currently a little bug. Maybe we could close for now and re-open this after we bump it up to 2.4.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    An issue is raised for uniVocity parser: https://github.com/uniVocity/univocity-parsers/issues/143


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by jbax <gi...@git.apache.org>.
Github user jbax commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    2.4.0 released, thank you guys!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Documentation for DataFrameReader, DataFrameWriter, DataStreamReader, readwriter.py and streaming.py are written. Check please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r104401595
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, avoiding double backslash") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +
    +      val df1 = spark.sqlContext.createDataFrame(List(
    +        (1, "aa\\\"bb,cc"),         // aa\"bb,cc
    +        (2, "aa\\\\\"bb,cc"),       // aa\\\"bb,cc
    +        (3, "aa\\\"\\\"bb,cc"),     // aa\"\"bb,cc
    +        (4, "aa\\\\\"\\\\\"bb,cc")  // aa\\\"\\\"bb,cc
    +      ))
    +
    +      df1.coalesce(1).write
    +              .format("csv")
    +              .option("quote", "\"")
    +              .option("escape", "\\")
    +              .save(csvDir)
    +
    +      val df2 = spark.read.csv(csvDir).orderBy($"_c0")
    +
    +      val df1Str = df1.collect().map(_.getString(1)).mkString(" ")
    +
    +      val df2Str = df2.select("_c1").collect().map(_.getString(0)).mkString(" ")
    +
    +      val text = spark.read
    +              .format("text")
    +              .option("quote", "\"")
    +              .option("escape", "\\")
    +              .load(csvDir)
    +              .collect()
    +              .map(_.getString(0))
    +              .sortBy(_(0)).map(_.drop(3).init).mkString(" ")
    +
    +      val textExpected =
    +        "aa\\\\\"bb,cc aa\\\\\\\"bb,cc aa\\\\\"\"\\\\\"bb,cc aa\\\\\\\"\"\\\"\\\\\"bb,cc"
    --- End diff --
    
    I think this would be more readable if this is wrapped by triple quotes (`"""..."""`) in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105306195
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -88,6 +88,7 @@ private[csv] class CSVOptions(
     
       val quote = getChar("quote", '\"')
       val escape = getChar("escape", '\\')
    +  val escapeEscape = getChar("escapeEscape", quote)
    --- End diff --
    
    Also, I would make this as an option for this CSV datasource. I think we should document this in `DataFrameReader`, `DataStreamReader`, `readwriter.py` and `streaming.py`. 
    
    Lastly, how about making this from `escapeEscape ` to `escapeQuoteEscaping` in order to take after the original name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    The bug is fixed in uniVocity.  https://github.com/uniVocity/univocity-parsers/issues/143
    This will be considered (on Monday).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv encoding/decoding using escape of...

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    cc @HyukjinKwon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Thank you @HyukjinKwon .
    
    I made changes following your comments:
     * `escapeQuoteEscaping` instead of `escapeEscape`
     * defalutl value to `\u0000` (unset)
     * `withTempPath`
     * no `orderBy`
     * `checkAnswer`
     * styles : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    @jbax 
    
    They are not CSV inputs. They are RAW TEXT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105376506
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -693,8 +697,8 @@ def text(self, path, compression=None):
     
         @since(2.0)
         def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None,
    -            header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None,
    -            timestampFormat=None, timeZone=None):
    +            escapeQuoteEscaping=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None,
    --- End diff --
    
    @HyukjinKwon 
    
    It is fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r104401044
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, avoiding double backslash") {
    --- End diff --
    
    Could we maybe narrow down what it tests and reduce the test codes?  It seems little bit confusing what it tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv encoding/decoding using escape of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105304318
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -88,6 +88,7 @@ private[csv] class CSVOptions(
     
       val quote = getChar("quote", '\"')
       val escape = getChar("escape", '\\')
    +  val escapeEscape = getChar("escapeEscape", quote)
    --- End diff --
    
    @ep1804, let's use `'\0'` by default as it is the default for `setCharToEscapeQuoteEscaping` if not set according to https://github.com/uniVocity/univocity-parsers/blob/e34a149077147de7cafcafbbaf8a4310e6297584/src/main/java/com/univocity/parsers/csv/CsvFormat.java#L160-L176 because I am worried of changing the existing behaviour for lower versions. Maybe, it is okay for now if it is configurable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105306543
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, using char to escape quote-escape") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +
    +
    --- End diff --
    
    super minor nit: here too, single newline if more commits should be pushed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 closed the pull request at:

    https://github.com/apache/spark/pull/17177


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105306510
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, using char to escape quote-escape") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +
    +
    +      val df1 = spark.sqlContext.createDataFrame(List(
    +        (1, """AA\"BB"""),              // 1 escape char anc 1 quote char
    +        (2, """AA\\"BB"""),             // 2 escape char and 1 quote char
    +        (3, """AA\"\"\"\"\"BB"""),      // (1 escape char anc 1 quote char) * 5
    +        (4, """AA\\"\\"\\"\\"\\"BB"""), // (2 escape char and 1 quote char) * 5
    +        (5, """You are "beautiful"""),
    +        (6, """Yes, \"inside"\""")
    +      ))
    +
    +      // escapeQuotes should be true by default
    +      // escapeEscape should be equal to quote by default
    +      df1.coalesce(1).write
    +        .format("csv")
    +        .option("quote", "\"")
    +        .option("escape", "\\")
    +        .option("escapeEscape", "\"")
    +        .save(csvDir)
    +
    +      val df2 = spark.read
    +        .format("csv")
    +        .option("quote", "\"")
    +        .option("escape", "\\")
    +        .option("escapeEscape", "\"")
    +        .load(csvDir).orderBy($"_c0")
    +
    +      val df1Str = df1.collect().map(_.getString(1)).mkString("")
    +      val df2Str = df2.collect().map(_.getString(1)).mkString("")
    +
    +      assert(df1Str == df2Str)
    +    }
    +  }
    +
    +
    --- End diff --
    
    super minor: single newline :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r105307145
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -465,6 +465,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("save csv with quote escaping enabled, using char to escape quote-escape") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +
    +
    +      val df1 = spark.sqlContext.createDataFrame(List(
    +        (1, """AA\"BB"""),              // 1 escape char anc 1 quote char
    +        (2, """AA\\"BB"""),             // 2 escape char and 1 quote char
    +        (3, """AA\"\"\"\"\"BB"""),      // (1 escape char anc 1 quote char) * 5
    +        (4, """AA\\"\\"\\"\\"\\"BB"""), // (2 escape char and 1 quote char) * 5
    +        (5, """You are "beautiful"""),
    +        (6, """Yes, \"inside"\""")
    +      ))
    +
    +      // escapeQuotes should be true by default
    +      // escapeEscape should be equal to quote by default
    +      df1.coalesce(1).write
    +        .format("csv")
    +        .option("quote", "\"")
    +        .option("escape", "\\")
    +        .option("escapeEscape", "\"")
    +        .save(csvDir)
    +
    +      val df2 = spark.read
    +        .format("csv")
    +        .option("quote", "\"")
    +        .option("escape", "\\")
    +        .option("escapeEscape", "\"")
    +        .load(csvDir).orderBy($"_c0")
    --- End diff --
    
    Per https://github.com/apache/spark/pull/17177/files#r105307099, I think we would not need this `orderBy`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

Posted by ep1804 <gi...@git.apache.org>.
Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    I agree with you @HyukjinKwon , this PR will be closed for now and re-open.
    
    And, thank you for the notice @jbox !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17177#discussion_r104456057
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -167,6 +168,7 @@ private[csv] class CSVOptions(
         format.setDelimiter(delimiter)
         format.setQuote(quote)
         format.setQuoteEscape(escape)
    +    format.setCharToEscapeQuoteEscaping(quote)
    --- End diff --
    
    It seems in https://github.com/uniVocity/univocity-parsers/blob/master/README.md#escaping-quote-escape-characters it sets the same `escape` for both `setQuoteEscape` and `setCharToEscapeQuoteEscaping`and I sounds correct to me. However, it seems different here. Do you mind if I ask the reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org