You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/29 19:17:00 UTC

[jira] [Comment Edited] (ARROW-15590) [C++] Add support for joins to the Substrait consumer

    [ https://issues.apache.org/jira/browse/ARROW-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530201#comment-17530201 ] 

Weston Pace edited comment on ARROW-15590 at 4/29/22 7:16 PM:
--------------------------------------------------------------

The Substrait spec (the website) doesn't always match the .proto yet.  This is not a great thing but it's a work in progress.  Feel free to open some PRs against the site if you want.  In the meantime I find it easier to work with the proto:

{noformat}
message JoinRel {
  RelCommon common = 1;
  Rel left = 2;
  Rel right = 3;
  Expression expression = 4;
  Expression post_join_filter = 5;

  JoinType type = 6;
  ...
}
{noformat}

The {{post_join_filter}} is not on the site today and should match {{HashJoinNodeOptions::filter}}.

{{LeftInput}} and {{RightInput}} correspond to the inputs specified when adding a join to a plan and so they aren't in {{HashJoinNodeOptions}}:

{noformat}
MakeExecNode("hashjoin", plan.get(), {LeftInput, RightInput}, join_options));
{noformat}

You are correct that we do not handle expressions in general for the join condition.  So I think the best thing to do here initially is restrict the set of allowed plans.  If the expression is not a call then reject it.  If the expression is a call then it must be one of two functions, "equal" or "is_not_distinct_from".  In either case the function has two arguments.  Both arguments must be a {{FieldReference}}.  We can convert from a Substrait {{FieldReference}} to an Arrow {{FieldRef}} and so that will give you left keys and right keys.  There is an Arrow options {{HashJoinNodeOptions::key_cmp}}.  If the Substrait function is "equal" then use {{JoinKeyCmp::Eq}}.  If the Substrait function is "is_not_distinct_from" then use {{JoinKeyCmp::Is}}.

With the above approach you will always have exactly one left key, one right key, and one join type.

Later (could be in this PR or a follow-up) we can also handle expressions that are an and'ed set of equality expressions:

{noformat}
and(equal(field(3),field(5)), equal(field(1),field(7)), equal(field(2), field(12)))
{noformat}

In this case the number of keys/join types you have would depend on the number of equality expressions in the and (3 in the above example).


was (Author: westonpace):
The Substrait spec (the website) doesn't always match the .proto yet.  This is not a great thing but it's a work in progress.  Feel free to open some PRs against the site if you want.  In the meantime I find it easier to work with the proto:

```
message JoinRel {
  RelCommon common = 1;
  Rel left = 2;
  Rel right = 3;
  Expression expression = 4;
  Expression post_join_filter = 5;

  JoinType type = 6;
  ...
}
```

The {{post_join_filter}} is not on the site today and should match {{HashJoinNodeOptions::filter}}.

{{LeftInput}} and {{RightInput}} correspond to the inputs specified when adding a join to a plan and so they aren't in {{HashJoinNodeOptions}}:

```
MakeExecNode("hashjoin", plan.get(), {LeftInput, RightInput}, join_options));
```

You are correct that we do not handle expressions in general for the join condition.  So I think the best thing to do here initially is restrict the set of allowed plans.  If the expression is not a call then reject it.  If the expression is a call then it must be one of two functions, "equal" or "is_not_distinct_from".  In either case the function has two arguments.  Both arguments must be a {{FieldReference}}.  We can convert from a Substrait {{FieldReference}} to an Arrow {{FieldRef}} and so that will give you left keys and right keys.  There is an Arrow options {{HashJoinNodeOptions::key_cmp}}.  If the Substrait function is "equal" then use {{JoinKeyCmp::Eq}}.  If the Substrait function is "is_not_distinct_from" then use {{JoinKeyCmp::Is}}.

With the above approach you will always have exactly one left key, one right key, and one join type.

Later (could be in this PR or a follow-up) we can also handle expressions that are an and'ed set of equality expressions:

{noformat}
and(equal(field(3),field(5)), equal(field(1),field(7)), equal(field(2), field(12)))
{noformat}

In this case the number of keys/join types you have would depend on the number of equality expressions in the and (3 in the above example).

> [C++] Add support for joins to the Substrait consumer
> -----------------------------------------------------
>
>                 Key: ARROW-15590
>                 URL: https://issues.apache.org/jira/browse/ARROW-15590
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: substrait
>
> The streaming execution engine supports joins.  The Substrait consumer does not currently consume joins.  We should add support for this.  We may want to split this PR into subtasks as there are many different kinds of joins and we may not support all of them immediately.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)