You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ohad Raviv (JIRA)" <ji...@apache.org> on 2018/11/06 14:54:00 UTC
[jira] [Created] (SPARK-25951) Redundant shuffle if column is
renamed
Ohad Raviv created SPARK-25951:
----------------------------------
Summary: Redundant shuffle if column is renamed
Key: SPARK-25951
URL: https://issues.apache.org/jira/browse/SPARK-25951
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.3.0
Reporter: Ohad Raviv
we've noticed that sometimes a column rename causes extra shuffle:
{code}
val N = 1 << 12
spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"),
col("key1")===col("key3"))
.explain(true)
{code}
results in:
{code}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
: +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
: +- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L])
: +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
: +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key3#22L, 2)
+- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L])
+- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L])
+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L]
+- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
+- *(3) Range (0, 4096, step=1, splits=1)
{code}
I was able to track it down to this code in class HashPartitioning:
{code}
case h: HashClusteredDistribution =>
expressions.length == h.expressions.length && expressions.zip(h.expressions).forall {
case (l, r) => l.semanticEquals(r)
}
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org