You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wensheng Wang (Jira)" <ji...@apache.org> on 2021/03/09 23:10:00 UTC
[jira] [Commented] (SPARK-32399) Support full outer join in shuffled hash join

    [ https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298414#comment-17298414 ] 

Wensheng Wang commented on SPARK-32399:
---------------------------------------

[~chengsu] [~cloud_fan] 

It seems there is a potential bug in the current implementation for FullOuterJoin in ShuffledHashJoinExec:

The implementation of FullOuterJoin always assumes the left of "JoinedRow" is the left input of SHJ, vice versa. This is different from other Joins' implementation which assumes the left of "JoinedRow" is the streaming side and the right is the build side. However all these join implementations share the same "boundCondition“ which assumes the left of "JoinedRow" is the streaming side and the right is the build side. This could cause the FullOuterJoin evaluates non-equi join conditions in the wrong way. 

To verify this, simply remove the "if (joinType != FullOuter)" condition in the OuterJoinSuite which tests FullOuter join implementation for SHJ, it fails with the output. Please help to confirm. Thanks.

 

!Screen Shot 2021-03-09 at 3.06.30 PM.png!

> Support full outer join in shuffled hash join
> ---------------------------------------------
>
>                 Key: SPARK-32399
>                 URL: https://issues.apache.org/jira/browse/SPARK-32399
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Cheng Su
>            Assignee: Cheng Su
>            Priority: Minor
>             Fix For: 3.1.0
>
>         Attachments: Screen Shot 2020-10-14 at 11.08.37 PM.png, Screen Shot 2020-10-14 at 12.30.07 PM.png, Screen Shot 2021-03-09 at 3.06.30 PM.png
>
>
> Currently for SQL full outer join, spark always does a sort merge join no matter of how large the join children size are. Inspired by recent discussion in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and [https://github.com/apache/spark/pull/29181], I think we can support full outer join in shuffled hash join in a way that - when looking up stream side keys from build side {{HashedRelation}}. Mark this info inside build side {{HashedRelation}}, and after reading all rows from stream side, output all non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org