You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2022/01/06 07:52:34 UTC
[spark] branch master updated: [SPARK-37822][SQL] StringSplit should return an array on non-null elements
This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new db15b82 [SPARK-37822][SQL] StringSplit should return an array on non-null elements
db15b82 is described below
commit db15b82be96cfd0f392b149b43b06148b639d9d7
Author: Shardul Mahadik <sm...@linkedin.com>
AuthorDate: Thu Jan 6 15:51:33 2022 +0800
[SPARK-37822][SQL] StringSplit should return an array on non-null elements
### What changes were proposed in this pull request?
Currently, `split` [returns the data type](https://github.com/apache/spark/blob/08dd010860cc176a33073928f4c0780d0ee98a08/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L532) `ArrayType(StringType)` which means the resultant array can contain nullable elements. However I do not see any case where the array can contain nulls.
In the case where either the provided string or delimiter is `NULL`, the output will be a `NULL` array. In case of empty string or no chars between delemiters, the output array will contain empty strings but never `NULL`s. So I propose we change the return type of `split` to mark elements as non-null.
### Why are the changes needed?
Provides a more accurate return type for `split`
### Does this PR introduce _any_ user-facing change?
Yes, schema for queries using `split` will change to show an array of non-null elements.
### How was this patch tested?
Trivial change. Manually tested with Spark shell
```
scala> spark.sql("SELECT split('a,b,c', ',')").printSchema
root
|-- split(a,b,c, ,, -1): array (nullable = false)
| |-- element: string (containsNull = false)
```
I can't think of a better test case than just testing `StringSplit().dataType == Array(StringType, containsNull = false)` at which point, it is just duplicating the actual definition of `StringSplit`.
Closes #35111 from shardulm94/spark-37822.
Authored-by: Shardul Mahadik <sm...@linkedin.com>
Signed-off-by: Wenchen Fan <we...@databricks.com>
---
.../org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index e14e9ab..889c53b 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -529,7 +529,7 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
case class StringSplit(str: Expression, regex: Expression, limit: Expression)
extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
- override def dataType: DataType = ArrayType(StringType)
+ override def dataType: DataType = ArrayType(StringType, containsNull = false)
override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
override def first: Expression = str
override def second: Expression = regex
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org