You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/08/07 05:27:48 UTC

[GitHub] spark pull request #22019: [SPARK-25040][SQL] Empty string for double and fl...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/22019

    [SPARK-25040][SQL] Empty string for double and float types should be nulls in JSON

    ## What changes were proposed in this pull request?
    
    This PR proposes to treat empty strings for double and float types as `null` consistently. Looks we mistakenly missed this corner case, which I guess is not that serious since this looks happened betwen 1.x and 2.x, and pretty corner case.
    
    For an easy reproducer, in case of double, the code below raises an error:
    
    ```scala
    spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 1.1, "b": 1.1}""").toDS).show()
    ```
    
    ```scala
    Caused by: java.lang.RuntimeException: Cannot parse  as double.
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:163)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:152)
      at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
      at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$convertObject(JacksonParser.scala:312)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:71)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:70)
      at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:368)
      at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:363)
      at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2491)
      at org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:363)
      at org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
      at org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
      at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
      ... 24 more
    ```
    
    Unlike other types:
    
    ```scala
    spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 1, "b": 1}""").toDS).show()
    ```
    
    ```
    +----+----+
    |   a|   b|
    +----+----+
    |null|null|
    |   1|   1|
    +----+----+
    ```
    
    ## How was this patch tested?
    
    Unit tests were added and manually tested.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark double-float-empty

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22019.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22019
    
----
commit ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b
Author: hyukjinkwon <gu...@...>
Date:   2018-08-07T05:23:43Z

    Empty string for double and float types should be nulls in JSON

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Empty string should be treated as null for all non string types?
    https://github.com/apache/spark/blob/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L282-L285


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    I found this:
    
    https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L69
    
    in branch-1.6 and was thinking if we actually should just disallow. Let me take a closer look and make some fixes here with some details soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    @viirya and @MaxGekk, are you busy? Do you mind if I ask to take over this? we will completely disallow empty strings in other types and target it 3.0.0. The changes wouldn't be too much and it requires to update the migration guide.
    
    I will be busy for a couple of weeks so I would appreciate it if you find some time to take over this.
    
    Otherwise, I will start to work on this after a couple of weeks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    **[Test build #94342 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94342/testReport)** for PR 22019 at commit [`ef57fdd`](https://github.com/apache/spark/commit/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    **[Test build #94342 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94342/testReport)** for PR 22019 at commit [`ef57fdd`](https://github.com/apache/spark/commit/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22019: [WIP][SPARK-25040][SQL] Empty string for double a...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/22019


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    What do you guys think about completely disallow empty strings in other types and target it 3.0.0? In theory, empty string is a string and I think strictly it's more correct to disallow them. Seems that's the original intention in the code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Looks few other types could potentially have this issue too. Let me fix them all here while I am here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    @HyukjinKwon thanks for pinging me. Let's look at this and see if I can make a PR soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    SGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    I agree with this proposal @HyukjinKwon. I think it is wrong to consider as a null an empty string. An empty string is not a valid value for an int/double/... So in case we have, we should fail I think.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94342/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1890/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    SGTM too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Let me reopen a PR and proceed this after 2.4.0 or the code freeze


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Hm.. wait let me take a closer look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    cc @cloud-fan, @viirya and @fuqiliang


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    > Empty string should be treated as null for all non string types?
    
    I would exclude complex types.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22019
  
    Probably I will not be able to look at this for the next a few weeks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org