You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/08 09:16:00 UTC

[GitHub] [arrow-datafusion] Dandandan commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

Dandandan commented on issue #4139:
URL: https://github.com/apache/arrow-datafusion/issues/4139#issuecomment-1306880712

   Sounds like a good plan.
   
   For hash join, probably needs some benchmarking to figure out good defaults and avoid performance degradation. `CollectLeft` limits the amount of parellization on the left side: building the hash table is relatively expensive and is done (at least currently) in a single thread. In quite a few cases it might be more beneficial to do a (local) hash repartitioning which is relatively cheap.
   It also depends on the size of the probe/right side: if that's e.g. >100x as big as the left side it might be beneficial to avoid the hash repartitioning on the right side by switching to `CollectLeft`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org