You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by aa8y <gi...@git.apache.org> on 2017/12/23 21:35:54 UTC

[GitHub] spark pull request #20068: SPARK-17916: Fix empty string being parsed as nul...

GitHub user aa8y opened a pull request:

    https://github.com/apache/spark/pull/20068

    SPARK-17916: Fix empty string being parsed as null when nullValue is set.

    ## What changes were proposed in this pull request?
    
    When the option `nullValue` is set, the empty value is also set to the same value. Therefore empty strings get parsed as `null`, which should not happen. This PR explicitly changes this to be an empty string.
    
    ## How was this patch tested?
    
    Tests were added without the fix. It was tested that they failed. Then the fix was added and the tests have been ensured to pass.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aa8y/spark csvEmptyValue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20068
    
----
commit 1c3d2216380c9cc89ea829588305b5f31c71d6d5
Author: Jeff Zhang <zj...@...>
Date:   2016-04-29T17:42:52Z

    Rebase with master.

commit b4eddd67234637feb1b255811d8d018b28894095
Author: Arun Allamsetty <ar...@...>
Date:   2017-10-14T19:46:53Z

    Merge remote-tracking branch 'upstream/master'

commit f406de9fe13f96b0ee615d496c283b21f415fd2b
Author: Arun Allamsetty <ar...@...>
Date:   2017-12-12T00:44:15Z

    Merge remote-tracking branch 'upstream/master'

commit 762c14487c762a193fd4f4359c51aaba71eca3f9
Author: Arun Allamsetty <ar...@...>
Date:   2017-12-21T21:49:50Z

    Merge remote-tracking branch 'upstream/master'

commit ebe2900aadd3af0114ed71506088c6a736dd5002
Author: Arun Allamsetty <ar...@...>
Date:   2017-12-21T22:52:15Z

    SPARK-17916: Fix empty string being parsed as null when nullValue is set.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158593580
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -152,7 +152,7 @@ class CSVOptions(
         writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
         writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
         writerSettings.setNullValue(nullValue)
    -    writerSettings.setEmptyValue(nullValue)
    +    writerSettings.setEmptyValue("")
    --- End diff --
    
    Can we simply expose this as an option and keep the previous behaviour if this option is not set explicitly by the user?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158616872
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
    +    val sparkSession = spark
    +
    +    val elems = Seq(("bar"), (""), (null: String))
    +
    +    // Checks for new behavior where an empty string is not coerced to null.
    +    withTempDir { dir =>
    --- End diff --
    
    We could do this `withTempPath { path =>`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by aa8y <gi...@git.apache.org>.
Github user aa8y commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158606107
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -152,7 +152,7 @@ class CSVOptions(
         writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
         writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
         writerSettings.setNullValue(nullValue)
    -    writerSettings.setEmptyValue(nullValue)
    +    writerSettings.setEmptyValue("")
    --- End diff --
    
    I disagree. I don't think the previous behavior should _not_ be exposed as an option as the previous behavior was a bug. All it did was that it _always_ coerced empty values to `null`s. If the `nullValue` was not set, then the it was set to `""` by default which coerced `""` to `null`. The empty value being set to `""` had no affect in this case. If it was set to something else, say `\N`, then the empty value was also set to `\N` which resulted in parsing both `\N` and `""` to `null`, as `""` was no longer considered as an empty value and the `""` being coerced to null is the Univocity parser's default.
    
    Setting empty value explicitly to the `""` literal would ensure that an empty string is always parsed as empty string, unless `nullValue` is not set or it is set to `""`, which is what people would do if they want `""` to be parsed as `null`, which would be the old behavior.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #85353 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85353/testReport)** for PR 20068 at commit [`ebe2900`](https://github.com/apache/spark/commit/ebe2900aadd3af0114ed71506088c6a736dd5002).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by aa8y <gi...@git.apache.org>.
Github user aa8y commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r159160508
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -152,7 +152,11 @@ class CSVOptions(
         writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
         writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
         writerSettings.setNullValue(nullValue)
    -    writerSettings.setEmptyValue(nullValue)
    +    // The Univocity parser parses empty strings as `null` by default. This is the default behavior
    +    // for Spark too, since `nullValue` defaults to an empty string and has a higher precedence to
    +    // setEmptyValue(). But when `nullValue` is set to a different value, that would mean that the
    +    // empty string should be parsed not as `null` but as an empty string.
    +    writerSettings.setEmptyValue("")
    --- End diff --
    
    I talked about this with Hyukjin Kwon before. I think the previous behavior should _not_ be exposed as an option as the previous behavior was a bug. All it did was that it _always_ coerced empty values to `null`s. If the `nullValue` was not set, then the it was set to `""` by default which coerced `""` to `null`. The empty value being set to `""` had no affect in this case. If it was set to something else, say `\N`, then the empty value was also set to `\N` which resulted in parsing both `\N` and `""` to `null`, as `""` was no longer considered as an empty value and the `""` being coerced to null is the Univocity parser's default.
    
    Setting empty value explicitly to the `""` literal would ensure that an empty string is always parsed as empty string, unless `nullValue` is not set or it is set to `""`, which is what people would do if they want `""` to be parsed as `null`, which would be the old behavior.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20068


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85368/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    ping @aa8y 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86739/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #85353 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85353/testReport)** for PR 20068 at commit [`ebe2900`](https://github.com/apache/spark/commit/ebe2900aadd3af0114ed71506088c6a736dd5002).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by aa8y <gi...@git.apache.org>.
Github user aa8y commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    I apologize I haven't had time to work on this. I can close this for now and reopen it when I have a working fix for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #85368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85368/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158616912
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
    +    val sparkSession = spark
    +
    +    val elems = Seq(("bar"), (""), (null: String))
    +
    +    // Checks for new behavior where an empty string is not coerced to null.
    +    withTempDir { dir =>
    +      val outDir = new File(dir, "out").getCanonicalPath
    +      val nullValue = "\\N"
    +
    +      import sparkSession.implicits._
    +      val dsIn = spark.createDataset(elems)
    +      dsIn.write
    +        .option("nullValue", nullValue)
    +        .csv(outDir)
    +      val dsOut = spark.read
    +        .option("nullValue", nullValue)
    +        .schema(dsIn.schema)
    +        .csv(outDir)
    +        .as[(String)]
    +      val computed = dsOut.collect.toSeq
    +      val expected = Seq(("bar"), (null: String))
    +
    +      assert(computed.size === 2)
    +      assert(computed.sameElements(expected))
    --- End diff --
    
    We can use `checkAnswer(..: DataFrame, .. : DataFrame)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158616812
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
    +    val sparkSession = spark
    +
    +    val elems = Seq(("bar"), (""), (null: String))
    +
    +    // Checks for new behavior where an empty string is not coerced to null.
    +    withTempDir { dir =>
    +      val outDir = new File(dir, "out").getCanonicalPath
    +      val nullValue = "\\N"
    +
    +      import sparkSession.implicits._
    --- End diff --
    
    I think we don't need this (by `import testImplicits._` above).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r159118238
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -152,7 +152,11 @@ class CSVOptions(
         writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
         writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
         writerSettings.setNullValue(nullValue)
    -    writerSettings.setEmptyValue(nullValue)
    +    // The Univocity parser parses empty strings as `null` by default. This is the default behavior
    +    // for Spark too, since `nullValue` defaults to an empty string and has a higher precedence to
    +    // setEmptyValue(). But when `nullValue` is set to a different value, that would mean that the
    +    // empty string should be parsed not as `null` but as an empty string.
    +    writerSettings.setEmptyValue("")
    --- End diff --
    
    Please make it as a conf like what we did for `nullValue`?
    
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L613


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    ping @aa8y @HyukjinKwon @MaxGekk @gengliangwang 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r161378950
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -152,7 +152,11 @@ class CSVOptions(
         writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
         writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
         writerSettings.setNullValue(nullValue)
    -    writerSettings.setEmptyValue(nullValue)
    +    // The Univocity parser parses empty strings as `null` by default. This is the default behavior
    +    // for Spark too, since `nullValue` defaults to an empty string and has a higher precedence to
    +    // setEmptyValue(). But when `nullValue` is set to a different value, that would mean that the
    +    // empty string should be parsed not as `null` but as an empty string.
    +    writerSettings.setEmptyValue("")
    --- End diff --
    
    This conf is needed for avoiding the behavior changes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158616834
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
    +    val sparkSession = spark
    +
    +    val elems = Seq(("bar"), (""), (null: String))
    +
    +    // Checks for new behavior where an empty string is not coerced to null.
    +    withTempDir { dir =>
    +      val outDir = new File(dir, "out").getCanonicalPath
    +      val nullValue = "\\N"
    +
    +      import sparkSession.implicits._
    +      val dsIn = spark.createDataset(elems)
    --- End diff --
    
    `Seq(("bar"), (""), (null: String)).toDS`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #85404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85404/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85353/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158616740
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -152,7 +152,7 @@ class CSVOptions(
         writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
         writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
         writerSettings.setNullValue(nullValue)
    -    writerSettings.setEmptyValue(nullValue)
    +    writerSettings.setEmptyValue("")
    --- End diff --
    
    Could we leave some comments here and update the PR description too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #86739 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86739/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by aa8y <gi...@git.apache.org>.
Github user aa8y commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    I'll work on it in the next week or two. That would involve a PR to the Univocity CSV parser.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #85368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85368/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Can we make the tests pass BTW, @aa8y?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #86739 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86739/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    **[Test build #85404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85404/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by aa8y <gi...@git.apache.org>.
Github user aa8y commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    @gatorsmile I've created this PR since #12904 has not been updated in a while.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85404/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by aa8y <gi...@git.apache.org>.
Github user aa8y commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    @HyukjinKwon I made code changes based on your suggestions. I also changed the tests to use the data mentioned in the ticket. However, you're right, the tests no longer pass. But that is because the Univocity `CsvParser`, when it encounters an empty string while parsing the data, replaces it with the `nullValue` we set (see [setNullValue()](http://docs.univocity.com/parsers/2.5.9/com/univocity/parsers/common/CommonSettings.html#setNullValue(java.lang.String))). And the `emptyValue` is only effective when the _empty string_ being read has quotes around it (see [setEmptyValue()](http://docs.univocity.com/parsers/2.5.9/com/univocity/parsers/csv/CsvParserSettings.html#setEmptyValue(java.lang.String))). So I believe, at this point, the issue needs to be fixed in the underlying library being used.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158617095
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
    +    val sparkSession = spark
    +
    +    val elems = Seq(("bar"), (""), (null: String))
    +
    +    // Checks for new behavior where an empty string is not coerced to null.
    +    withTempDir { dir =>
    +      val outDir = new File(dir, "out").getCanonicalPath
    +      val nullValue = "\\N"
    +
    +      import sparkSession.implicits._
    +      val dsIn = spark.createDataset(elems)
    +      dsIn.write
    +        .option("nullValue", nullValue)
    +        .csv(outDir)
    +      val dsOut = spark.read
    +        .option("nullValue", nullValue)
    +        .schema(dsIn.schema)
    +        .csv(outDir)
    +        .as[(String)]
    +      val computed = dsOut.collect.toSeq
    +      val expected = Seq(("bar"), (null: String))
    --- End diff --
    
    I don't think this is quite the expected output? Could we use the examples provided in the JIRA rather than single row ones?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20068
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20068#discussion_r158616591
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
    +    val sparkSession = spark
    --- End diff --
    
    I think we can just use `spark`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org