You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Herman van Hövell (Jira)" <ji...@apache.org> on 2020/02/13 19:11:00 UTC

[jira] [Assigned] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations

     [ https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Herman van Hövell reassigned SPARK-30751:
-----------------------------------------

    Assignee: Wenchen Fan

> Combine the skewed readers into one in AQE skew join optimizations
> ------------------------------------------------------------------
>
>                 Key: SPARK-30751
>                 URL: https://issues.apache.org/jira/browse/SPARK-30751
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Wei Xue
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Assume we have N partitions based on the original join keys, and for a specific partition id {{Pi}} (i = 1 to N), we slice the left partition into {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without skew) joins. *This can be a serious performance concern as the size of the query plan now depends on the number and size of skewed partitions.*
> Now instead of generating so many joins we can create a “repeated” reader for either side of the join so that:
>  # for the left side, with each partition id Pi and any given slice {{Sj}} in {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}
>  # for the right side, with each partition id Pi and any given slice {{Tk}} in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}
> That way, we can have one SMJ for all the partitions and only one type of special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org