You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Ash (JIRA)" <ji...@apache.org> on 2017/01/24 20:02:26 UTC
[jira] [Commented] (SPARK-11471) Improve the way that we plan shuffled join

    [ https://issues.apache.org/jira/browse/SPARK-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836547#comment-15836547 ] 

Andrew Ash commented on SPARK-11471:
------------------------------------

[~yhuai] I'm interested in helping make progress on this -- it's causing pretty significant slowness on a SQL query I have.  Have you had any more thoughts on this since you first filed?

For my slow query, running the same SQL on another system (Teradata) completes in a few minutes whereas in Spark SQL it eventually fails after 20h+ of computation.  By breaking up the query into subqueries (forcing joins to happen in a different order) we can get the Spark SQL execution down to about 15min.

The query joins 8 tables together with a number of where clauses, and the join tree ends up doing a large number of seemingly "extra" shuffles.  At the bottom of the join tree, tables A and B are both shuffled to the same distribution and then joined.  Then when table C comes in (the next table) it's shuffled to a new distribution and then the result of (A+B) is reshuffled to match the new table C distribution.  Potentially instead table C would be shuffled to match the distribution of (A+B) so that (A+B) doesn't have to be reshuffled.

The general pattern here seems to be that N-Way joins require N + N-1 = 2N-1 shuffles.  Potentially some of those shuffles could be eliminated with more intelligent join ordering and/or distribution selection.

What do you think?

> Improve the way that we plan shuffled join
> ------------------------------------------
>
>                 Key: SPARK-11471
>                 URL: https://issues.apache.org/jira/browse/SPARK-11471
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>
> Right now, when adaptive query execution is enabled, in most of cases, we will shuffle input tables for every join. However, once we finish our work of https://issues.apache.org/jira/browse/SPARK-10665, we will be able to have a global on the input datasets of a stage. Then, we should be able to add exchange coordinators after we get the entire physical plan (after the phase that we add Exchanges).
> I will try to fill in more information later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org