You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/08 08:45:55 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new issue, #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

mingmwang opened a new issue, #4139:
URL: https://github.com/apache/arrow-datafusion/issues/4139

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

Posted by GitBox <gi...@apache.org>.

mingmwang commented on issue #4139:
URL: https://github.com/apache/arrow-datafusion/issues/4139#issuecomment-1306856449

   @alamb @andygrove @isidentical @Dandandan @yahoNanJing 
   I plan to add a new physical rule for physical join implementation selection based on stats.
   
   The basic idea is if one of side(left or right) is small enough(threshold on size or row count, for example 10M), will choose HashJoin:CollectLeft.
   Else if either side exceed that threshold but still smaller than hash join threshold(estimated_size/partition_count <= 128M for example), will choose  HashJoin: Partitioned.
   Otherwise choose SortMergeJoin.
   
   Please share your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #4139:
URL: https://github.com/apache/arrow-datafusion/issues/4139#issuecomment-1306880712

   Sounds like a good plan.
   
   For hash join, probably needs some benchmarking to figure out good defaults and avoid performance degradation. `CollectLeft` limits the amount of parellization on the left side: building the hash table is relatively expensive and is done (at least currently) in a single thread. In quite a few cases it might be more beneficial to do a (local) hash repartitioning which is relatively cheap.
   It also depends on the size of the probe/right side: if that's e.g. >100x as big as the left side it might be beneficial to avoid the hash repartitioning on the right side by switching to `CollectLeft`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

Posted by GitBox <gi...@apache.org>.

alamb closed issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft)  or SortMergeJoin base on Stats
URL: https://github.com/apache/arrow-datafusion/issues/4139


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #4139:
URL: https://github.com/apache/arrow-datafusion/issues/4139#issuecomment-1307854196

   I agree with @Dandandan  that doing some benchmarking on sort merge join is likely a good idea. I don't think we have much at the moment as it is never used.
   
   What I think would be an ideal solution, though maybe harder to implement, is to *NOT* decide between has or sort merge join at plan time, but to decide at runtime
   
   So that would involve starting out using HashJoin but if the hash table spilled (or exceeded some size threshold) sort/spill it and then switch to sort-merge-join
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org