You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michel Davit (JIRA)" <ji...@apache.org> on 2018/04/15 08:43:00 UTC
[jira] [Created] (SPARK-23986) CompileException when using too many
avg aggregation after joining
Michel Davit created SPARK-23986:
------------------------------------
Summary: CompileException when using too many avg aggregation after joining
Key: SPARK-23986
URL: https://issues.apache.org/jira/browse/SPARK-23986
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.0
Reporter: Michel Davit
Considering the following code:
{code:java}
val df1: DataFrame = sparkSession.sparkContext
.makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
.toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
val df2: DataFrame = sparkSession.sparkContext
.makeRDD(Seq((0, "val1", "val2")))
.toDF("key", "dummy1", "dummy2")
val agg = df1
.join(df2, df1("key") === df2("key"), "leftouter")
.groupBy(df1("key"))
.agg(
avg("col2").as("avg2"),
avg("col3").as("avg3"),
avg("col4").as("avg4"),
avg("col1").as("avg1"),
avg("col5").as("avg5"),
avg("col6").as("avg6")
)
val head = agg.take(1)
{code}
This logs the following exception:
{code:java}
ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 467, Column 28: Redefinition of parameter "agg_expr_11"
{code}
I am not a spark expert but after investigation, I realized that the generated {{doConsume}} method is responsible of the exception.
Indeed, {{avg}} calls several times {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. The 1st time with the 'avg' Expr and a second time for the base aggregation Expr (count and sum).
The problem comes from the generation of parameters in CodeGenerator:
{code:java}
/**
* Returns a term name that is unique within this instance of a `CodegenContext`.
*/
def freshName(name: String): String = synchronized {
val fullName = if (freshNamePrefix == "") {
name
} else {
s"${freshNamePrefix}_$name"
}
if (freshNameIds.contains(fullName)) {
val id = freshNameIds(fullName)
freshNameIds(fullName) = id + 1
s"$fullName$id"
} else {
freshNameIds += fullName -> 1
fullName
}
}
{code}
The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
The second call is made with {{agg_expr_[1..12]}} and generates the following names:
{{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
Appending the 'id' in s"$fullName$id" to generate unique term name is source of conflict. Maybe simply using undersoce can solve this issue : $fullName_$id"
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org