You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (Jira)" <ji...@apache.org> on 2019/11/01 02:09:00 UTC
[jira] [Created] (SPARK-29694) Execute UDF only once when there are
multiple identical UDF usages
Yuming Wang created SPARK-29694:
-----------------------------------
Summary: Execute UDF only once when there are multiple identical UDF usages
Key: SPARK-29694
URL: https://issues.apache.org/jira/browse/SPARK-29694
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang
Example:
{code:sql}
SELECT
CASE
WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
WHEN col2 IN (28,29) THEN
WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 'Google'
WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE '%gmail%' THEN 'Google'
WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo'
WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing'
WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook'
WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest'
WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE '%t.co' THEN 'Twitter'
WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu'
WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex'
WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL'
WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask'
WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online'
WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE '%comdesc.%' THEN 'com'
WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal'
WHEN udf1(col1, 'rd') IS NULL THEN 'None'
ELSE 'Other' END AS source_domain,
COUNT(*) AS cnt
FROM
tbl s
GROUP BY
1
{code}
We can rewrite it to:
{code:sql}
SELECT
CASE
WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
WHEN col2 IN (28,29) THEN 'Google'
WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google'
WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google'
WHEN col1 LIKE '%yahoo%' THEN 'Yahoo'
WHEN col1 LIKE '%bing%' THEN 'Bing'
WHEN col1 LIKE '%facebook%' THEN 'Facebook'
WHEN col1 LIKE '%pinterest%' THEN 'Pinterest'
WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter'
WHEN col1 LIKE '%baidu%' THEN 'Baidu'
WHEN col1 LIKE '%yandex%' THEN 'Yandex'
WHEN col1 LIKE '%aol.%' THEN 'AOL'
WHEN col1 LIKE '%ask.%' THEN 'Ask'
WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
WHEN col1 LIKE '%t-online.de' THEN 'T-Online'
WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com'
WHEN col1 LIKE '%paypal.%' THEN 'PayPal'
WHEN col1 IS NULL THEN 'None'
ELSE 'Other' END AS source_domain,
COUNT(*) AS cnt
FROM
(SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s
GROUP BY
1
{code}
It would be great if we could optimize it by the framework.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org