You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/17 19:05:10 UTC

[GitHub] [spark] c21 opened a new pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

c21 opened a new pull request #34034:
URL: https://github.com/apache/spark/pull/34034

### What changes were proposed in this pull request?

For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we only need to keep one row per unique join key(s) inside hash table (`HashedRelation`) when building the hash table. This can help reduce the size of hash table of join.

This PR adds the optimization in `UnsafeHashedRelation` for broadcast hash join and shuffled hash join. The optimization for `LongHashedRelation` would be added later in the future, because it needs more change of underlying hash table data structure `LongToUnsafeRowMap` to check if key exists in hash table or not.

### Why are the changes needed?

Help reduce the hash table size of join for LEFT SEMI and LEFT ANTI.
This can increase the chance of broadcast join of these queries, and reduce OOM possibility of shuffled hash join.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite.scala`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org