You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2019/01/15 19:25:00 UTC

[jira] [Assigned] (SPARK-26626) Repeated use of aliases can cause driver OOMs

     [ https://issues.apache.org/jira/browse/SPARK-26626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-26626:
------------------------------------

    Assignee:     (was: Apache Spark)

> Repeated use of aliases can cause driver OOMs
> ---------------------------------------------
>
>                 Key: SPARK-26626
>                 URL: https://issues.apache.org/jira/browse/SPARK-26626
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.4.0
>            Reporter: Jesse Rickard
>            Priority: Major
>
> If a query contains many aliases that are used multiple times, it can cause the driver to OOM, as catalyst will recursively substitute the aliases, making the expression tree size grow exponentially.
> For example:
>  
> {noformat}
> scala> var df = Seq(1, 2, 3).toDF("a").withColumn("b", lit(10)).cache()
> df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> for( i <- 1 to 5 ) {
>  | df = df.select(('a + 'b).as('a), ('a - 'b).as('b))
>  | }
> scala> df.queryExecution.analyzed
> res11: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [(a#526 + b#527) AS a#530, (a#526 - b#527) AS b#531]
> +- Project [(a#522 + b#523) AS a#526, (a#522 - b#523) AS b#527]
>  +- Project [(a#518 + b#519) AS a#522, (a#518 - b#519) AS b#523]
>  +- Project [(a#514 + b#515) AS a#518, (a#514 - b#515) AS b#519]
>  +- Project [(a#509 + b#511) AS a#514, (a#509 - b#511) AS b#515]
>  +- Project [a#509, 10 AS b#511]
>  +- Project [value#507 AS a#509]
>  +- LocalRelation [value#507]
> scala> df.queryExecution.optimizedPlan
> res12: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [(((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) + ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS a#530, (((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) - ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS b#531]
> +- InMemoryRelation [a#509, b#511], StorageLevel(disk, memory, deserial...
> {noformat}
>  
>  
> In larger real-world instances of this, the expression tree size can explode so large as to OOM the driver.
>  
> This is caused by CollapseProject and PhysicalOperation recursively substituting all aliases, without consideration for the effect on the size of the expression tree.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org