You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2022/01/06 07:52:34 UTC
[spark] branch master updated: [SPARK-37822][SQL] StringSplit should return an array on non-null elements

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new db15b82  [SPARK-37822][SQL] StringSplit should return an array on non-null elements
db15b82 is described below

commit db15b82be96cfd0f392b149b43b06148b639d9d7
Author: Shardul Mahadik <sm...@linkedin.com>
AuthorDate: Thu Jan 6 15:51:33 2022 +0800

    [SPARK-37822][SQL] StringSplit should return an array on non-null elements
    
    ### What changes were proposed in this pull request?
    Currently, `split` [returns the data type](https://github.com/apache/spark/blob/08dd010860cc176a33073928f4c0780d0ee98a08/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L532) `ArrayType(StringType)` which means the resultant array can contain nullable elements. However I do not see any case where the array can contain nulls.
    
    In the case where either the provided string or delimiter is `NULL`, the output will be a `NULL` array. In case of empty string or no chars between delemiters, the output array will contain empty strings but never `NULL`s. So I propose we change the return type of `split` to mark elements as non-null.
    
    ### Why are the changes needed?
    Provides a more accurate return type for `split`
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, schema for queries using `split` will change to show an array of non-null elements.
    
    ### How was this patch tested?
    Trivial change. Manually tested with Spark shell
    ```
    scala> spark.sql("SELECT split('a,b,c', ',')").printSchema
    root
     |-- split(a,b,c, ,, -1): array (nullable = false)
     |    |-- element: string (containsNull = false)
    ```
     I can't think of a better test case than just testing `StringSplit().dataType == Array(StringType, containsNull = false)` at which point, it is just duplicating the actual definition of `StringSplit`.
    
    Closes #35111 from shardulm94/spark-37822.
    
    Authored-by: Shardul Mahadik <sm...@linkedin.com>
    Signed-off-by: Wenchen Fan <we...@databricks.com>
---
 .../org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index e14e9ab..889c53b 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -529,7 +529,7 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType)
+  override def dataType: DataType = ArrayType(StringType, containsNull = false)
   override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
   override def first: Expression = str
   override def second: Expression = regex

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org