You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jesse Rickard (JIRA)" <ji...@apache.org> on 2019/01/15 19:16:00 UTC
[jira] [Created] (SPARK-26626) Repeated use of aliases can cause
driver OOMs
Jesse Rickard created SPARK-26626:
-------------------------------------
Summary: Repeated use of aliases can cause driver OOMs
Key: SPARK-26626
URL: https://issues.apache.org/jira/browse/SPARK-26626
Project: Spark
Issue Type: Bug
Components: Optimizer, SQL
Affects Versions: 2.4.0
Reporter: Jesse Rickard
If a query contains many aliases that are used multiple times, it can cause the driver to OOM, as catalyst will recursively substitute the aliases, making the expression tree size grow exponentially.
For example:
{noformat}
scala> var df = Seq(1, 2, 3).toDF("a").withColumn("b", lit(10)).cache()
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> for( i <- 1 to 5 ) {
| df = df.select(('a + 'b).as('a), ('a - 'b).as('b))
| }
scala> df.queryExecution.analyzed
res11: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [(a#526 + b#527) AS a#530, (a#526 - b#527) AS b#531]
+- Project [(a#522 + b#523) AS a#526, (a#522 - b#523) AS b#527]
+- Project [(a#518 + b#519) AS a#522, (a#518 - b#519) AS b#523]
+- Project [(a#514 + b#515) AS a#518, (a#514 - b#515) AS b#519]
+- Project [(a#509 + b#511) AS a#514, (a#509 - b#511) AS b#515]
+- Project [a#509, 10 AS b#511]
+- Project [value#507 AS a#509]
+- LocalRelation [value#507]
scala> df.queryExecution.optimizedPlan
res12: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [(((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) + ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS a#530, (((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) + (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511)))) - ((((a#509 + b#511) + (a#509 - b#511)) + ((a#509 + b#511) - (a#509 - b#511))) - (((a#509 + b#511) + (a#509 - b#511)) - ((a#509 + b#511) - (a#509 - b#511))))) AS b#531]
+- InMemoryRelation [a#509, b#511], StorageLevel(disk, memory, deserial...
{noformat}
In larger real-world instances of this, the expression tree size can explode so large as to OOM the driver.
This is caused by CollapseProject and PhysicalOperation recursively substituting all aliases, without consideration for the effect on the size of the expression tree.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org