You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/28 16:43:32 UTC

[GitHub] [spark] advancedxy commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

advancedxy commented on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-828606822


   > > Per my knowledge I don't know any obviously efficient way to do random lookup join with spilled hash map.
   > > How do we minimize random disk read for spilled map? Happy to brainstorm more if you have some rough ideas.
   > 
   > `Hybrid hash join` was designed to address this issue:
   > 
   > * https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
   > * https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf
   > 
   > In short, it's not to spill an entire, huge hash map onto disk, but combines in-memory hash join with data partitioning/spilling. It's a standard algorithm that been implemented in many query engines. Similarly, you could implement hash aggregation in the same way, which should be more efficient than the runtime fallback approach too.
   > 
   > IMO, either (1) AQE or (2) conservative static QO decisions (e.g., planning shuffled hash join less aggressively) might be some low-hanging fruits, if they can address the issue. But if indeed neither could work for some of your queries, supporting spilling could be viable option to go.
   
   IMO, AQE is the low-hanging fruits for joins with shuffle. We did some benchmark on TPC-DS(SMJ v.s AQE auto converting SMJ-> SHJ), and saw obviously performance gain(~20%). We are launching this strategy in our ad-hoc workload.
   
   For SHJ falling back to SMJ,  I believe it's a viable way.  The SMJ is battle-tested and robust to most cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org