You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ohad Raviv (JIRA)" <ji...@apache.org> on 2018/11/06 14:54:00 UTC
[jira] [Created] (SPARK-25951) Redundant shuffle if column is renamed

Ohad Raviv created SPARK-25951:
----------------------------------

             Summary: Redundant shuffle if column is renamed
                 Key: SPARK-25951
                 URL: https://issues.apache.org/jira/browse/SPARK-25951
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Ohad Raviv


we've noticed that sometimes a column rename causes extra shuffle:
{code}
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")

t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"),
col("key1")===col("key3"))
.explain(true)
{code}

results in:

{code}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
: +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
: +- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L])
: +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
: +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key3#22L, 2)
+- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L])
+- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L])
+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L]
+- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
+- *(3) Range (0, 4096, step=1, splits=1)
{code}
I was able to track it down to this code in class HashPartitioning:
{code}
case h: HashClusteredDistribution =>
      expressions.length == h.expressions.length && expressions.zip(h.expressions).forall {
          case (l, r) => l.semanticEquals(r)
 }
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org