You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/11/14 16:03:00 UTC
[jira] [Assigned] (ARROW-9423) [Rust][DataFusion] Add join
[ https://issues.apache.org/jira/browse/ARROW-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove reassigned ARROW-9423:
---------------------------------
Assignee: (was: Andy Grove)
> [Rust][DataFusion] Add join
> ---------------------------
>
> Key: ARROW-9423
> URL: https://issues.apache.org/jira/browse/ARROW-9423
> Project: Apache Arrow
> Issue Type: Task
> Components: Rust - DataFusion
> Reporter: Jorge Leitão
> Priority: Major
>
> A major operation in analytics is the join. This issue concerns adding the join operation.
> Given the complexity of this task, I propose starting with a sub-set of all joins, an hash join whose "ON" can only be a set of column names (i.e. no expressions).
> Suggestion for DOD:
> * physical plan to execute the join
> * logical plan with the join
> * SQL planner with the join
> * tests on each of the above
> One idea to perform this join in parallel is to, for each RecordBatch in the left, perform the join with a record on the right. Another way is to first perform a hash by key and sort on both sides, and then perform a "SortMergeJoin" on each of the partitions. There may be better ways to achieve this, though.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)