You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2016/01/08 04:15:39 UTC
[jira] [Commented] (SPARK-12704) we may repartition a relation even
it's not needed
[ https://issues.apache.org/jira/browse/SPARK-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088635#comment-15088635 ]
Wenchen Fan commented on SPARK-12704:
-------------------------------------
cc [~joshrosen] [~nongli] [~marmbrus] [~yhuai] [~rxin]
> we may repartition a relation even it's not needed
> --------------------------------------------------
>
> Key: SPARK-12704
> URL: https://issues.apache.org/jira/browse/SPARK-12704
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Wenchen Fan
>
> The implementation of {{HashPartitioning.compatibleWith}} has been wrong for a while. Think of the following case:
> if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also partitioned by int column `i`, logically these 2 partitionings are compatible. However, {{HashPartitioning.compatibleWith}} will return false for this case as the {{AttributeReference}} of column `i` between these 2 tables have different expr ids.
> With this wrong result of {{HashPartitioning.compatibleWith}}, we will go into [this branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390] and may add unnecessary shuffle.
> This won't impact correctness if the join keys are exactly the same with hash partitioning keys, as there’s still an opportunity to not partition that child in that branch: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428
> However, if the join keys are a super-set of hash partitioning keys, for example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, and we wanna join them using column `i, j`, logically we don't need shuffle but in fact the 2 tables start out as partitioned only by `i` and is redundantly repartitioned by `i, j`.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org