You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sumeet (Jira)" <ji...@apache.org> on 2021/11/01 02:27:00 UTC

[jira] [Commented] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

    [ https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436588#comment-17436588 ] 

Sumeet commented on SPARK-37175:
--------------------------------

I am working on this.

> Performance improvement to hash joins with many duplicate keys
> --------------------------------------------------------------
>
>                 Key: SPARK-37175
>                 URL: https://issues.apache.org/jira/browse/SPARK-37175
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Bruce Robbins
>            Priority: Major
>         Attachments: hash_rel_examples.txt
>
>
> I noticed that HashedRelations with many duplicate keys perform significantly slower than HashedRelations with similar number of entries but few or no duplicate keys.
> A hypothesis:
>  * Because of the order in which rows are appended to the map, rows for a given key are typically non-adjacent in memory, resulting in poor locality.
>  * The map would perform better if all rows for a given key are next to each other in memory.
> To test this hypothesis, I made a [somewhat brute force change to HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play] to reorganize the map such that all rows for a given key are adjacent in memory. This yielded some performance improvements, at least in my contrived examples:
> (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):
> Example 1:
>  Shuffled Hash Join, LongHashedRelation:
>  Stream side: 300M rows
>  Build side: 90M rows, but only 200K unique keys
>  136G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled hash join (No reorganization)|1092| |
> |Shuffled hash join (with reorganization)|234|4.6 times faster than regular SHJ|
> |Sort merge join|164|This beats the SHJ when there are lots of duplicate keys, I presume because of better cache locality on both sides of the join|
> Example 2:
>  Broadcast Hash Join, LongHashedRelation:
>  Stream side: 350M rows
>  Build side 9M rows, but only 18K unique keys
>  175G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|872| |
> |Broadcast hash join (with reorganization)|263|3 times faster than regular BHJ|
> |Sort merge join|174|This beats the BHJ when there are lots of duplicate keys, I presume because of better cache locality on both sides of the join|
> Example 3:
>  Shuffled Hash Join, UnsafeHashedRelation
>  Stream side: 300M rows
>  Build side 90M rows, but only 200K unique keys
>  135G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled Hash Join (No reorganization)|3154| |
> |Shuffled Hash Join (with reorganization)|533|5.9 times faster|
> |Sort merge join|190|This beats the SHJ when there are lots of duplicate keys, I presume because of better cache locality on both sides of the join|
> Example 4:
>  Broadcast Hash Join, UnsafeHashedRelation:
>  Stream side: 70M rows
>  Build side 9M rows, but only 18K unique keys
>  35G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|849| |
> |Broadcast hash join (with reorganization)|130|6.5 times faster|
> |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, I presume because of better cache locality on both sides of the join|
> The code for these examples is attached here [^hash_rel_examples.txt]
> Even the brute force approach could be useful in production if
>  * Toggled by a feature flag
>  * Reorganizes only if the ratio of keys to rows drops below some threshold
>  * Falls back to using the original map if building the new map results in a memory-related SparkException.
> Another incidental lesson is that sort merge join seems to outperform broadcast hash join when the build side has lots of duplicate keys. So maybe a long term improvement would be to avoid hash joins (broadcast or shuffle) if there are many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org