You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Asif (Jira)" <ji...@apache.org> on 2023/11/16 21:01:00 UTC

[jira] [Commented] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

    [ https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786941#comment-17786941 ] 

Asif commented on SPARK-45959:
------------------------------

will create a PR for the same..

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-45959
>                 URL: https://issues.apache.org/jira/browse/SPARK-45959
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.1
>            Reporter: Asif
>            Priority: Major
>
> Though documentation clearly recommends to add all columns in a single shot, but in reality is difficult to expect customer to modify their code, as in spark2  the rules in analyzer were such that  they did not do deep tree traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before creating a new projection, like if all the existing columns are still being projected, and the new column being added has an expression which is not depending on the output of the top node , but its child,  then instead of adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org