You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2016/09/16 23:02:20 UTC
[jira] [Created] (SPARK-17570) Avoid Hash and Exchange in Sort
Merge join if bucketing factor is multiple for tables
Tejas Patil created SPARK-17570:
-----------------------------------
Summary: Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables
Key: SPARK-17570
URL: https://issues.apache.org/jira/browse/SPARK-17570
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Reporter: Tejas Patil
Priority: Minor
In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the input tables and output table has same number of buckets. However, unequal bucketing will always lead to `Sort` and `Exchange`. If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those.
eg.
Assume Input1, Input2 and Output be bucketed + sorted tables over the same columns but with different number of buckets. Input1 has 8 buckets, Input1 has 4 buckets and Output has 4 buckets. Since hash-partitioning is done using Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in the same task, it would give the bucket 0 of output table.
{noformat}
Input1 (0, 4) (1, 3) (2, 5) (3, 7)
Input2 (0, 4, 8) (1, 3, 9) (2, 5, 10) (3, 7, 11)
Output (0) (1) (2) (3)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org