You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/05/31 02:24:00 UTC

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

    [ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354181#comment-17354181 ] 

Hyukjin Kwon commented on SPARK-35564:
--------------------------------------

cc [~viirya] FYI

> Support subexpression elimination for non-common branches of conditional expressions
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-35564
>                 URL: https://issues.apache.org/jira/browse/SPARK-35564
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling subexpressions out of branches of conditional expressions for expressions present in all branches. We should be able to take this a step further and pull out subexpressions for any branch, as long as that expression will definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for subexpression elimination (you can see two regular expression replacements in the codegen). But whenever the length is greater than 0, we will have to execute the regular expression replacement twice. Since we know we will always calculate `col` at least once, it makes sense to consider that as a subexpression since we might need it again in the branch value. So we can update the logic from:
> Create a subexpression if an expression will always be evaluated at least twice
> To:
> Create a subexpression if an expression will always be evaluated at least once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split subexpressions) if the second evaluation doesn't happen, but this seems like it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org