You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/26 04:58:54 UTC

[GitHub] [spark] sigmod edited a comment on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

sigmod edited a comment on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-826500384

> Per my knowledge I don't know any obviously efficient way to do random lookup join with spilled hash map.
> How do we minimize random disk read for spilled map? Happy to brainstorm more if you have some rough ideas.

`Hybrid hash join` was designed to address this issue:
- https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
- https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf

In short, it's not to spill an entire, huge hash map onto disk, but combines in-memory hash join with data partitioning/spilling. It's a standard algorithm that been implemented in many query engines. Similarly, you could implement hash aggregation in the same way, which should be more efficient than the runtime fallback approach too.

IMO, either (1) AQE or (2) conservative static QO decisions (e.g., planning shuffled hash join less aggressively) might be some low-hanging fruits, if they can address the issue. But if indeed neither could work for some of your queries, supporting spilling could be viable option to go.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org