You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/08 08:58:53 UTC

[GitHub] [arrow-datafusion] mingmwang commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

mingmwang commented on issue #4139:
URL: https://github.com/apache/arrow-datafusion/issues/4139#issuecomment-1306856449

   @alamb @andygrove @isidentical @Dandandan @yahoNanJing 
   I plan to add a new physical rule for physical join implementation selection based on stats.
   
   The basic idea is if one of side(left or right) is small enough(threshold on size or row count, for example 10M), will choose HashJoin:CollectLeft.
   Else if either side exceed that threshold but still smaller than hash join threshold(estimated_size/partition_count <= 128M for example), will choose  HashJoin: Partitioned.
   Otherwise choose SortMergeJoin.
   
   Please share your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org