You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/12/31 19:34:00 UTC

[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

     [ https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove updated ARROW-11094:
-------------------------------
    Description: 
The current hash join works well when one side of the join can be loaded into memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the left and right partitions, and write the intermediate results to disk, and then stream both sides of the join by merging these sorted partitions and we do not need to load one side into memory. At most, we need to load all batches from both sides that contain the current join key values.

[https://en.wikipedia.org/wiki/Sort-merge_join]

  was:
The current hash join works well when one side of the join can be loaded into memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the left and right partitions in parallel and then stream both sides of the join by merging these sorted partitions and we do not need to load one side into memory. At most, we need to load all batches from both sides that contain the current join key values.

https://en.wikipedia.org/wiki/Sort-merge_join


> [Rust] [DataFusion] Implement Sort-Merge Join
> ---------------------------------------------
>
>                 Key: ARROW-11094
>                 URL: https://issues.apache.org/jira/browse/ARROW-11094
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Priority: Major
>             Fix For: 4.0.0
>
>
> The current hash join works well when one side of the join can be loaded into memory but cannot scale beyond the available RAM.
> The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the left and right partitions, and write the intermediate results to disk, and then stream both sides of the join by merging these sorted partitions and we do not need to load one side into memory. At most, we need to load all batches from both sides that contain the current join key values.
> [https://en.wikipedia.org/wiki/Sort-merge_join]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)