You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <ju...@hydromatic.net> on 2014/08/11 20:19:52 UTC

Semi-join orientation

I am adding rules to make semi-joins more practical [1] [2] [3]. Which side of a semi-join is conventionally the “build” side? Right now it’s the right. For example, the query

  select * from dept where deptno in (select deptno from emp where gender =‘F’)

becomes

  SemiJoin(Scan(dept), Filter(Scan(emp), gender = ‘F’), 0.deptno = 1.deptno)

Filter(Scan(emp), gender = ‘F’) is the “build” side, and you can imagine implementing it by populating a hash-table with distinct keys before starting to read from the “dept” table.

If we’re generating left-deep trees, we put the smaller input on the left-hand side. So, should the “build” side of a semi-join go on the left also?

Julian

[1] https://issues.apache.org/jira/browse/OPTIQ-367
[2] https://issues.apache.org/jira/browse/OPTIQ-368
[3] https://issues.apache.org/jira/browse/OPTIQ-369

Re: Semi-join orientation

Posted by Vladimir Sitnikov <si...@gmail.com>.

I've logged an issue for alternative implementation of semi-joins as
https://issues.apache.org/jira/browse/OPTIQ-379.

I believe the amount of required memory (see in the issue) should be good
costing for the current enumerable implementation (no spill to disk
materializations, etc).

Vladimir

Re: Semi-join orientation

Posted by Julian Hyde <ju...@gmail.com>.

That makes sense.

In https://issues.apache.org/jira/browse/OPTIQ-369, I have implemented EnumerableSemiJoinRel. An example (from misc.oq):

select * from "hr"."emps"
where exists (
  select 1 from "hr"."depts" where "depts"."deptno" = "emps"."deptno");
+-------+--------+-----------+---------+------------+
| empid | deptno | name      | salary  | commission |
+-------+--------+-----------+---------+------------+
|   100 |     10 | Bill      | 10000.0 |       1000 |
|   110 |     10 | Theodore  | 11500.0 |        250 |
|   150 |     10 | Sebastian | 7000.0  |            |
+-------+--------+-----------+---------+------------+
(3 rows)

!ok
EnumerableSemiJoinRel(condition=[=($1, $5)], joinType=[inner])
  EnumerableTableAccessRel(table=[[hr, emps]])
  EnumerableCalcRel(expr#0..3=[{inputs}], expr#4=[true], $f01=[$t0], $f0=[$t4])
    EnumerableJoinRel(condition=[=($0, $1)], joinType=[inner])
      EnumerableAggregateRel(group=[{0}])
        EnumerableCalcRel(expr#0..4=[{inputs}], $f0=[$t1])
          EnumerableTableAccessRel(table=[[hr, emps]])
      EnumerableTableAccessRel(table=[[hr, depts]])

The generated code uses the new Enumerables.semiJoin function, which builds a HashSet of the keys the “inner” side (depts.deptno) then probes using a stream from the “outer” side (emps).

The other use of “depts” in the query plan is as a “value generator”. As https://issues.apache.org/jira/browse/OPTIQ-375 notes, the value generator is often not necessary.

Vladimir, 

Can you please log a JIRA case for your proposed “reverse” implementation of semi-join where you build the non-semi-join side? It would help if you add a cost function that would help us decide which implementation to use.

Julian

 
On Aug 11, 2014, at 1:48 PM, Vladimir Sitnikov <si...@gmail.com> wrote:

>> Filter(Scan(emp), gender = ‘F’) is the “build” side, and you can imagine
> implementing it by populating a hash-table with distinct keys before
> starting to read from the “dept” table.
> 
> There is more than one way to skin a hash semi-join.
> 
> Imagine the following physical plan:
> 1) Scan(dept) first and build Map<deptno, List<dept.row>>
> 2) Perform Filter(Scan(emp), gender = ‘F’) and check if the join key is in
> memory. If found, remove the entry from the map and return rows. If not
> found, just ignore the row.
> 
> This can be more efficient in case dept is smaller than Filter(Scan(emp),
> gender = ‘F’).
> 
> I suggest the following:
> 1) Let the right side of the semi join be the one that is semi-joined (as
> in your example above)
> 2) "build" side (in your terms above) is more of a physical plan, thus it
> is irrelevant at the logical level (see below). For example, for enumerable
> convention we could have EnumerableHashJoinSemi (builds table over left
> relation) and EnumerableHashJoinSemiRight (builds table over right
> relation).
> 
> Vladimir

Re: Semi-join orientation

Posted by Vladimir Sitnikov <si...@gmail.com>.

>Filter(Scan(emp), gender = ‘F’) is the “build” side, and you can imagine
implementing it by populating a hash-table with distinct keys before
starting to read from the “dept” table.

There is more than one way to skin a hash semi-join.

Imagine the following physical plan:
1) Scan(dept) first and build Map<deptno, List<dept.row>>
2) Perform Filter(Scan(emp), gender = ‘F’) and check if the join key is in
memory. If found, remove the entry from the map and return rows. If not
found, just ignore the row.

This can be more efficient in case dept is smaller than Filter(Scan(emp),
gender = ‘F’).

I suggest the following:
1) Let the right side of the semi join be the one that is semi-joined (as
in your example above)
2) "build" side (in your terms above) is more of a physical plan, thus it
is irrelevant at the logical level (see below). For example, for enumerable
convention we could have EnumerableHashJoinSemi (builds table over left
relation) and EnumerableHashJoinSemiRight (builds table over right
relation).

Vladimir