You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xuedong Luan (Jira)" <ji...@apache.org> on 2019/11/01 03:26:00 UTC

[jira] [Commented] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages

    [ https://issues.apache.org/jira/browse/SPARK-29694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964571#comment-16964571 ] 

Xuedong Luan commented on SPARK-29694:
--------------------------------------

hi [~yumwang]

 I will work on this Jira

> Execute UDF only once when there are multiple identical UDF usages
> ------------------------------------------------------------------
>
>                 Key: SPARK-29694
>                 URL: https://issues.apache.org/jira/browse/SPARK-29694
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> Example:
> {code:sql}
> SELECT
> 	CASE
> 		WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
> 		WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
> 		WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
> 		WHEN col2 IN (28,29) THEN
> 		WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
> 		WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
> 		WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 'Google'
> 		WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE '%gmail%' THEN 'Google'
> 		WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo'
> 		WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing'
> 		WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook'
> 		WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest'
> 		WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE '%t.co' THEN 'Twitter'
> 		WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu'
> 		WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex'
> 		WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL'
> 		WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask'
> 		WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
> 		WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online'
> 		WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
> 		WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE '%comdesc.%' THEN 'com'
> 		WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal'
> 		WHEN udf1(col1, 'rd') IS NULL THEN 'None'
> 		ELSE 'Other' END AS source_domain,
> 	COUNT(*) AS cnt
> FROM
> 	tbl s
> GROUP BY
> 	1
> {code}
> We can rewrite it to:
> {code:sql}
> SELECT
> 	CASE
> 		WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
> 		WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
> 		WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
> 		WHEN col2 IN (28,29) THEN 'Google'
> 		WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
> 		WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
> 		WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google'
> 		WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google'
> 		WHEN col1 LIKE '%yahoo%' THEN 'Yahoo'
> 		WHEN col1 LIKE '%bing%' THEN 'Bing'
> 		WHEN col1 LIKE '%facebook%' THEN 'Facebook'
> 		WHEN col1 LIKE '%pinterest%' THEN 'Pinterest'
> 		WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter'
> 		WHEN col1 LIKE '%baidu%' THEN 'Baidu'
> 		WHEN col1 LIKE '%yandex%' THEN 'Yandex'
> 		WHEN col1 LIKE '%aol.%' THEN 'AOL'
> 		WHEN col1 LIKE '%ask.%' THEN 'Ask'
> 		WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
> 		WHEN col1 LIKE '%t-online.de' THEN 'T-Online'
> 		WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
> 		WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com'
> 		WHEN col1 LIKE '%paypal.%' THEN 'PayPal'
> 		WHEN col1 IS NULL THEN 'None'
> 		ELSE 'Other' END AS source_domain,
> 	COUNT(*) AS cnt
> FROM
> 	(SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s
> GROUP BY
> 	1
> {code}
> It would be great if we could optimize it by the framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org