You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aishwarya Srivastava (Jira)" <ji...@apache.org> on 2022/05/26 21:28:00 UTC

[jira] [Created] (SPARK-39307) Need a spark.sql.function to return array from a collection of arrays

Aishwarya Srivastava created SPARK-39307:
--------------------------------------------

             Summary:   Need a spark.sql.function to return array from a collection of arrays
                 Key: SPARK-39307
                 URL: https://issues.apache.org/jira/browse/SPARK-39307
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.2.1
            Reporter: Aishwarya Srivastava


In scenarios where a lot of UDFs centered around chunking collections into fixed* sized sub-arrays or arrays of strings.

The only alternative is to use complex and expensive regex that does not work for non-String use cases.

*left over elements carrying the remainder of the size

Example

===

scala> def grouped_by_size[C](size: Int, collection: Iterable[C]) = collection.grouped(size).toArray
grouped_by_size: [C](size: Int, collection: Iterable[C])Array[Iterable[C]]

scala> grouped_by_size(2, "Samuel Shepard")
res0: Array[Iterable[Char]] = Array(Sa, mu, el, S, he, pa, rd)

scala> grouped_by_size(2, (0 to 4).toSeq )
res1: Array[Iterable[Int]] = Array(Vector(0, 1), Vector(2, 3), Vector(4))

scala> grouped_by_size(3, "Samuel Shepard")
res2: Array[Iterable[Char]] = Array(Sam, uel, Sh, epa, rd)

scala> grouped_by_size(3, Array("This","is","my","last","example"))
res3: Array[Iterable[String]] = Array(WrappedArray(This, is, my), WrappedArray(last, example))
 * Elements in an array or string, depending on what the data domain is, may have a natural periodicity that has semantic meaning.
 * Being able to easily divide a large collection into an Array of Arrays or Array of String aids in applying transforms, explodes, and other array functions (zip, etc) at the appropriate level of periodicity for the data domain.

Pros:
 * Current methods for creating periodicity require Strings only as well as complex and inefficient regular expressions instead a simpler and more direct solution. The proposed solution would work for both Strings and Arrays.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org