You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/06/09 20:15:00 UTC

[jira] [Commented] (SPARK-27775) Support multiple return values for udf

    [ https://issues.apache.org/jira/browse/SPARK-27775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859553#comment-16859553 ] 

Dongjoon Hyun commented on SPARK-27775:
---------------------------------------

Hi, [~advancedxy]. Thank you for filing JIRA issue. I updated the Affected version to 3.0.0 because the new improvement is not allowed to be backport (2.4.x).

> Support multiple return values for udf
> --------------------------------------
>
>                 Key: SPARK-27775
>                 URL: https://issues.apache.org/jira/browse/SPARK-27775
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Xianjin YE
>            Priority: Major
>
> Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for udf, which is inspired by one of our internal SQL systems.
>  
> Current Spark SQL and Hive don't support multiple return values  for one udf. One alternative would be returning StructType for UDF, and then select corresponding fields. Two downsides about that approach:
>  * The SQL is more complex than multi alias, quite unreadable for multiple similar UDFs.
>  * the UDF code is evaluated multiple times, one time per Projection.
> for example, suppose one udf is defined as below:
> {code:java}
> // Scala
> def myFunc: (String => (String, String)) = { s => println("xx"); (s.toLowerCase, s.toUpperCase)}
> val myUDF = udf(myFunc)
> {code}
> To select multiple fields of myUDF,  I have to do:
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain()
> == Physical Plan ==
> *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164]
> +- *(1) Range (0, 10, step=1, splits=48)
> {code}
> or 
> {code:java}
> // Scala
> spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from t1) t2").explain()
> == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS _2#156] +- *(1) Range (0, 10, step=1, splits=48)
> {code}
>  The udf `myUDF` has to be evaluated twice for two projection.
> If we can support multi alias for structure returned udf, we can simply do this, and extract multiple return values with only one evaluation of udf.
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id) as (x, y) from t1"){code}
>  
> [SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi alias support for udtfs, the support for udfs is not. cc [~scwf] and [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org