You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Simeon Simeonov (JIRA)" <ji...@apache.org> on 2016/08/08 17:31:20 UTC

[jira] [Created] (SPARK-16954) UDFs should allow output type to be specified in terms of the input type

Simeon Simeonov created SPARK-16954:
---------------------------------------

             Summary: UDFs should allow output type to be specified in terms of the input type
                 Key: SPARK-16954
                 URL: https://issues.apache.org/jira/browse/SPARK-16954
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Simeon Simeonov


Consider an {{array_compact}} UDF that removes {{null}} values from an array. There is no easy way to implement this UDF because an explicit return type with a {{TypeTag}} is required by code generation.

The interesting observation here is that the output type of `array_compact` is the same as its input type. In general, there is a broad class of UDFs, especially collection-oriented ones, whose output types are functions of the input types. In our Spark work we have found collection manipulation UDFs to be very powerful for cleaning up data and substantially improving performance, in particular, avoiding {{explode}} followed by {{groupBy}}. It would be nice if Spark made adding these types of UDFs very easy.

I won't go into possible ways to implement this under the covers as there are many options but I do want to point out that it is possible to communicate the right type information to Spark without changing the signature for UDF registration using placeholder types, e.g.,

{code}
sealed trait UDFArgumentAtPosition
case class ArgPos1 extends UDFArgumentAtPosition
case class ArgPos2 extends UDFArgumentAtPosition
// ...

case class Struct[ArgPos <: UDFArgumentAtPosition](value: Row)
case class ArrayElement[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
case class MapKey[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
case class MapValue[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)

// Functions are stubbed just to show compilation succeeds
def arrayCompact[A : TypeTag](xs: Seq[A]): ArgPos1 = null
def arraySum[A : Numeric : TypeTag](xs: Seq[A]): ArrayElement[ArgPos1, A] = ArrayElement(implicitly[Numeric[A]].zero) 
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org