You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jesse Rickard (JIRA)" <ji...@apache.org> on 2019/01/15 19:16:00 UTC

[jira] [Created] (SPARK-26626) Repeated use of aliases can cause driver OOMs

Jesse Rickard created SPARK-26626:
-------------------------------------

             Summary: Repeated use of aliases can cause driver OOMs
                 Key: SPARK-26626
                 URL: https://issues.apache.org/jira/browse/SPARK-26626
             Project: Spark
          Issue Type: Bug
          Components: Optimizer, SQL
    Affects Versions: 2.4.0
            Reporter: Jesse Rickard


If a query contains many aliases that are used multiple times, it can cause the driver to OOM, as catalyst will recursively substitute the aliases, making the expression tree size grow exponentially.

For example:

 
{noformat}
scala> var df = Seq(1, 2, 3).toDF("a").withColumn("b", lit(10)).cache()
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> for( i <- 1 to 5 ) {
 | df = df.select(('a + 'b).as('a), ('a - 'b).as('b))
 | }
scala> df.queryExecution.analyzed
res11: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [(a#526 + b#527) AS a#530, (a#526 - b#527) AS b#531]
+- Project [(a#522 + b#523) AS a#526, (a#522 - b#523) AS b#527]
 +- Project [(a#518 + b#519) AS a#522, (a#518 - b#519) AS b#523]
 +- Project [(a#514 + b#515) AS a#518, (a#514 - b#515) AS b#519]
 +- Project [(a#509 + b#511) AS a#514, (a#509 - b#511) AS b#515]
 +- Project [a#509, 10 AS b#511]
 +- Project [value#507 AS a#509]
 +- LocalRelation [value#507]
scala> df.queryExecution.optimizedPlan
res12: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [(((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) + ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS a#530, (((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) - ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS b#531]
+- InMemoryRelation [a#509, b#511], StorageLevel(disk, memory, deserial...
{noformat}
 

 

In larger real-world instances of this, the expression tree size can explode so large as to OOM the driver.

 

This is caused by CollapseProject and PhysicalOperation recursively substituting all aliases, without consideration for the effect on the size of the expression tree.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org