You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2019/05/29 08:15:00 UTC

[jira] [Resolved] (SPARK-27829) In Dataset.joinWith inner joins, don't nest data before shuffling

     [ https://issues.apache.org/jira/browse/SPARK-27829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan resolved SPARK-27829.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 24693
[https://github.com/apache/spark/pull/24693]

> In Dataset.joinWith inner joins, don't nest data before shuffling
> -----------------------------------------------------------------
>
>                 Key: SPARK-27829
>                 URL: https://issues.apache.org/jira/browse/SPARK-27829
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In order to support outer joins with null top-level objects, SPARK-15441 modified Dataset.joinWith to project both inputs into single-column structs prior to the join.
> For inner joins, however, this step is unnecessary and actually harms performance: performing the nesting before the join increases the shuffled data size. As an optimization for inner joins only, we can move this nesting to occur after the join (effectively switching back to the pre- SPARK-15441 behavior). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org