You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/29 02:12:52 UTC

[GitHub] [spark] sumeetgajjar opened a new pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

sumeetgajjar opened a new pull request #35047:
URL: https://github.com/apache/spark/pull/35047

### What changes were proposed in this pull request?

This PR aims at improving performance for Hash joins with many duplicate keys.

A HashedRelation uses a map underneath to store rows against a corresponding key. A LongToUnsafeRowMap is used by LongHashedRelation and a BytesToBytesMap is used by UnsafeHashedRelation.
We propose to reorder the underlying map thereby placing all the rows for a given key adjacent in the memory to improve the spatial locality while iterating over them in the stream side of the join.

This is achieved in the following steps:
- creating another copy of the underlying map
- for all keys in the existing map
- get the corresponding rows
- insert all the rows for the given key at once in the new map
- use the new map for look-ups

This optimization can be enabled by specifying `spark.sql.hashedRelationReorderFactor=<value>`.
Once the condition `number of rows >= number of unique keys * above value` is satisfied for the underlying map, the optimization will kick in.

### Why are the changes needed?

There is no order maintained when the rows are added to the underlying map, thus for a given key, the corresponding rows are typically non-adjacent in memory, resulting in a poor spatial locality. Placing the rows for adjacent in memory yields a performance boost thereby reducing execution time.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Modified existing unit tests to run against the suggested improvement.
- Added a couple of cases to test the scenarios when the improvement throws an exception due to insufficient memory.
- Added a micro-benchmark that clearly indicates performance improvements when there are duplicate keys.
- Ran the four example queries mentioned in the JIRA in spark-sql as a final check for performance improvement.

### Credits
This work is based on the initial idea proposed by @bersprockets.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org