You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/30 00:12:37 UTC

[GitHub] [spark] c21 opened a new pull request #34444: [SPARK-32567][SQL] Add code-gen for full outer shuffled hash join

c21 opened a new pull request #34444:
URL: https://github.com/apache/spark/pull/34444

### What changes were proposed in this pull request?

As title. This PR is to add code-gen support for FULL OUTER shuffled hash join.

The main change is in `ShuffledHashJoinExec.scala:doProduce()` to generate code for FULL OUTER join.
* `ShuffledHashJoinExec.scala:codegenFullOuterJoinWithUniqueKey()` is the code for join with unique join key from build side.
* `ShuffledHashJoinExec.scala:codegenFullOuterJoinWithNonUniqueKey()` is the code for join with non-unique key.

Example query:

```
val df1 = spark.range(5).select($"id".as("k1"))
val df2 = spark.range(10).select($"id".as("k2"))
df1.join(df2.hint("SHUFFLE_HASH"), $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2", "full_outer")
```

Generated code for example query: https://gist.github.com/c21/828b782ee81827f4148939cb50314a7b

### Why are the changes needed?

Improve query performance for FULL OUTER shuffled hash join.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

* Added unit test in `WholeStageCodegenSuite`.
* Existing unit test in `OuterJoinSuite`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org