You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/08/07 05:27:48 UTC
[GitHub] spark pull request #22019: [SPARK-25040][SQL] Empty string for double and fl...
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/22019
[SPARK-25040][SQL] Empty string for double and float types should be nulls in JSON
## What changes were proposed in this pull request?
This PR proposes to treat empty strings for double and float types as `null` consistently. Looks we mistakenly missed this corner case, which I guess is not that serious since this looks happened betwen 1.x and 2.x, and pretty corner case.
For an easy reproducer, in case of double, the code below raises an error:
```scala
spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 1.1, "b": 1.1}""").toDS).show()
```
```scala
Caused by: java.lang.RuntimeException: Cannot parse as double.
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:163)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:152)
at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$convertObject(JacksonParser.scala:312)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:71)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:70)
at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:368)
at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:363)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2491)
at org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:363)
at org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
at org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
... 24 more
```
Unlike other types:
```scala
spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 1, "b": 1}""").toDS).show()
```
```
+----+----+
| a| b|
+----+----+
|null|null|
| 1| 1|
+----+----+
```
## How was this patch tested?
Unit tests were added and manually tested.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark double-float-empty
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22019.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22019
----
commit ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b
Author: hyukjinkwon <gu...@...>
Date: 2018-08-07T05:23:43Z
Empty string for double and float types should be nulls in JSON
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/22019
Empty string should be treated as null for all non string types?
https://github.com/apache/spark/blob/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L282-L285
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
I found this:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L69
in branch-1.6 and was thinking if we actually should just disallow. Let me take a closer look and make some fixes here with some details soon.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
@viirya and @MaxGekk, are you busy? Do you mind if I ask to take over this? we will completely disallow empty strings in other types and target it 3.0.0. The changes wouldn't be too much and it requires to update the migration guide.
I will be busy for a couple of weeks so I would appreciate it if you find some time to take over this.
Otherwise, I will start to work on this after a couple of weeks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22019
**[Test build #94342 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94342/testReport)** for PR 22019 at commit [`ef57fdd`](https://github.com/apache/spark/commit/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22019
**[Test build #94342 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94342/testReport)** for PR 22019 at commit [`ef57fdd`](https://github.com/apache/spark/commit/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22019
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22019: [WIP][SPARK-25040][SQL] Empty string for double a...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:
https://github.com/apache/spark/pull/22019
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
What do you guys think about completely disallow empty strings in other types and target it 3.0.0? In theory, empty string is a string and I think strictly it's more correct to disallow them. Seems that's the original intention in the code.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
Looks few other types could potentially have this issue too. Let me fix them all here while I am here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/22019
@HyukjinKwon thanks for pinging me. Let's look at this and see if I can make a PR soon.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22019
SGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22019
I agree with this proposal @HyukjinKwon. I think it is wrong to consider as a null an empty string. An empty string is not a valid value for an int/double/... So in case we have, we should fail I think.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22019
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94342/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22019
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1890/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/22019
SGTM too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
Let me reopen a PR and proceed this after 2.4.0 or the code freeze
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
Hm.. wait let me take a closer look.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22019
cc @cloud-fan, @viirya and @fuqiliang
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22019
> Empty string should be treated as null for all non string types?
I would exclude complex types.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22019
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/22019
Probably I will not be able to look at this for the next a few weeks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org