You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:23:41 UTC

[jira] [Updated] (SPARK-16954) UDFs should allow output type to be specified in terms of the input type

     [ https://issues.apache.org/jira/browse/SPARK-16954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-16954:
---------------------------------
    Labels: SQL UDF bulk-closed types  (was: SQL UDF types)

> UDFs should allow output type to be specified in terms of the input type
> ------------------------------------------------------------------------
>
>                 Key: SPARK-16954
>                 URL: https://issues.apache.org/jira/browse/SPARK-16954
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Simeon Simeonov
>            Priority: Major
>              Labels: SQL, UDF, bulk-closed, types
>
> Consider an {{array_compact}} UDF that removes {{null}} values from an array. There is no easy way to implement this UDF because an explicit return type with a {{TypeTag}} is required by code generation.
> The interesting observation here is that the output type of `array_compact` is the same as its input type. In general, there is a broad class of UDFs, especially collection-oriented ones, whose output types are functions of the input types. In our Spark work we have found collection manipulation UDFs to be very powerful for cleaning up data and substantially improving performance, in particular, avoiding {{explode}} followed by {{groupBy}}. It would be nice if Spark made adding these types of UDFs very easy.
> I won't go into possible ways to implement this under the covers as there are many options but I do want to point out that it is possible to communicate the right type information to Spark without changing the signature for UDF registration using placeholder types, e.g.,
> {code}
> sealed trait UDFArgumentAtPosition
> case class ArgPos1 extends UDFArgumentAtPosition
> case class ArgPos2 extends UDFArgumentAtPosition
> // ...
> case class Struct[ArgPos <: UDFArgumentAtPosition](value: Row)
> case class ArrayElement[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
> case class MapKey[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
> case class MapValue[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
> // Functions are stubbed just to show compilation succeeds
> def arrayCompact[A : TypeTag](xs: Seq[A]): ArgPos1 = null
> def arraySum[A : Numeric : TypeTag](xs: Seq[A]): ArrayElement[ArgPos1, A] = ArrayElement(implicitly[Numeric[A]].zero) 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org