You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/01/17 19:28:00 UTC
[jira] [Assigned] (SPARK-37290) Exponential planning time in case of non-deterministic function
[ https://issues.apache.org/jira/browse/SPARK-37290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-37290:
------------------------------------
Assignee: Apache Spark
> Exponential planning time in case of non-deterministic function
> ---------------------------------------------------------------
>
> Key: SPARK-37290
> URL: https://issues.apache.org/jira/browse/SPARK-37290
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2
> Reporter: Kaya Kupferschmidt
> Assignee: Apache Spark
> Priority: Major
>
> We are experiencing an exponential growth of processing time in case of some DataFrame queries including non-deterministic functions. I could create a small example program, which can be pasted into the Spark shell for reproducing the issue:
> {code:scala}
> val adselect_raw = spark.createDataFrame(Seq(("imp-1",1),("imp-2",2)))
> .cache()
> val adselect = adselect_raw.select(
> expr("uuid()").alias("userUuid"),
> expr("_1").alias("impressionUuid"),
> expr("_1").alias("accessDateTime"),
> expr("_1").alias("publisher"),
> expr("_1").alias("site"),
> expr("_1").alias("placement"),
> expr("_1").alias("advertiser"),
> expr("_1").alias("campaign"),
> expr("_1").alias("lineItem"),
> expr("_1").alias("creative"),
> expr("_1").alias("browserLanguage"),
> expr("_1").alias("geoLocode"),
> expr("_1").alias("osFamily"),
> expr("_1").alias("osName"),
> expr("_1").alias("browserName"),
> expr("_1").alias("referrerDomain"),
> expr("_1").alias("placementIabCategory"),
> expr("_1").alias("placementDeviceGroup"),
> expr("_1").alias("placementDevice"),
> expr("_1").alias("placementVideoType"),
> expr("_1").alias("placementSection"),
> expr("_1").alias("placementPlayer"),
> expr("_1").alias("demandType"),
> expr("_1").alias("techCosts"),
> expr("_1").alias("mediaCosts"),
> expr("_1").alias("directSPrice"),
> expr("_1").alias("network"),
> expr("_1").alias("deviceSetting"),
> expr("_1").alias("placementGroup"),
> expr("_1").alias("postalCode"),
> expr("_1").alias("householdId")
> )
> val adcount_raw = spark.createDataFrame(Seq(("imp-1", 1), ("imp-2", 2)))
> val adcount = adcount_raw.select(
> expr("_1").alias("impressionUuid"),
> expr("_2").alias("accessDateTime")
> )
> val result = adselect.join(adcount, Seq("impressionUuid"))
> result.explain()
> {code}
> Further reducing the program (for example by removing the join or the cache) did not show the problem any more.
> The problem occurs during planning time and debugging lead me to the function {{UnaryNode.getAllValidConstraints}} where the local variable {{allConstraints}} grew with an apparently exponential number of entries for the non-deterministic function "{{{}uuid(){}}}" in the code example above. Every time a new column from the large select is processed in the {{foreach}} loop in the function {{{}UnaryNode.getAllValidConstraints{}}}, the number of entries for the {{uuid()}} column in the ExpressionSet seems to be doubled:
> {code:scala}
> trait UnaryNode extends LogicalPlan with UnaryLike[LogicalPlan] {
> override def getAllValidConstraints(projectList: Seq[NamedExpression]): ExpressionSet = {
> var allConstraints = child.constraints
> projectList.foreach {
> case a @ Alias(l: Literal, _) =>
> allConstraints += EqualNullSafe(a.toAttribute, l)
> case a @ Alias(e, _) =>
> // KK: Since the ExpressionSet handles each non-deterministic function as a separate entry, each "uuid()" entry in allConstraints is re-added over an over again in every iteration,
> // thereby doubling the list every time
> allConstraints ++= allConstraints.map(_ transform {
> case expr: Expression if expr.semanticEquals(e) =>
> a.toAttribute
> })
> allConstraints += EqualNullSafe(e, a.toAttribute)
> case _ => // Don't change.
> }
> allConstraints
> }
> }
> {code}
> As a workaround, we moved the {{uuid()}} column in our code to the end of the list in the select statement, which solved the issue (since all other columns were already processed in the {{foreach}} loop).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org