You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2016/01/08 04:12:39 UTC

[jira] [Created] (SPARK-12704) we may repartition a relation even it's not needed

Wenchen Fan created SPARK-12704:
-----------------------------------

             Summary: we may repartition a relation even it's not needed
                 Key: SPARK-12704
                 URL: https://issues.apache.org/jira/browse/SPARK-12704
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Wenchen Fan


The implementation of {{HashPartitioning.compatibleWith}} has been wrong for a while. Think of the following case:
if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also partitioned by int column `i`, logically these 2 partitionings are compatible. However, {{HashPartitioning.compatibleWith}} will return false for this case as the {{AttributeReference}} of column `i` between these 2 tables have different expr ids.

With this wrong result of {{HashPartitioning.compatibleWith}}, we will go into [this branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390] and may add unnecessary shuffle.

This won't impact correctness if the join keys are exactly the same with hash partitioning keys, as there’s still an opportunity to not partition that child in that branch: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428

However, if the join keys are a super-set of hash partitioning keys, for example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, and we wanna join them using column `i, j`, logically we don't need shuffle but in fact the 2 tables start out as partitioned only by `i` and is redundantly repartitioned by `i, j`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org