You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2014/11/17 09:37:21 UTC

Correlated vs Semi joins

Hi,

As I develop EnumerableCorrelate executor, I struggle how semi-joins
should play with correlated-joins.

For instance: we have SemiJoin (that is basically a HashJoinSemi) and
CorrelateRel (that is NestedLoops). I wonder how one should represent
CorrelateSemiJoin (==NestedLoopsSemi) that is both correlated and
semi-join.
Ok, suppose we solve this somehow, then AntiJoin comes in with
correlated (e.g NestedLoopsAnti) and non-correlated (e.g.
HashJoinAnti) versions.

I am not sure if just adding these variations to the core would do the
trick since this would explode the rules.
E.g. the rule that cares of semi- wants to treat NestedLoopsSemi and
HashJoinSemi as semi-join and the rule that cares of correlations
(e.g. the one that de-correlates) does not care much if the join is
regular semi-join or anti-join.


-- 
Regards,
Vladimir Sitnikov

Re: Correlated vs Semi joins

Posted by Julian Hyde <ju...@gmail.com>.
+1

One minor tweak: just call it “left”, not “left_outer". There is no “left inner”.

On Nov 23, 2014, at 7:54 AM, Vladimir Sitnikov <si...@gmail.com> wrote:

> I am still playing with LogicalCorrelate and here's the current status:
> 0) The sample query and plans can be seen here:
> https://gist.github.com/vlsi/e08f53b3fd0b3db465c0
> 
> 1) Currently I use the following class hierarchy:
> Join -> BiRel -> AbstractRelNode
> Correlate -> BiRel -> AbstractRelNode
> So, Correlate is not a Join.
> 
> 2) Correlate node takes two inputs and _it_ assigns correlation
> variable to the left input. Single Correlate always assigns just a
> single variable. The var references the whole row, so no harm is made.
> Additionally, Correlate holds a bitset of the required columns from
> the left part.
> 
> 3) I killed RelNode.setCorrelVariable,
> RelNode.getOrCreateCorrelVariable, and RelNode.registerCorrelVariable.
> It is error-prone to keep the variable during tree rewrites. The
> required variable is always managed by Correlate node. This ensures
> you won't forget to copy the variable from old relation to new one.
> 
> 4) I plan to map execution operations as follows:
> NestedLoops     -> Correlate(A, B, regular)
> NestedLoopsOuter-> Correlate(A, B, left_outer)
> NestedLoopsSemi -> Correlate(A, B, semi)
> NestedLoopsAnti -> Correlate(A, B, anti)
> 
> HashJoin        -> EquiJoin(A, B) <-- existing join
> HashJoinOuter   -> EquiJoin(A, B) <-- existing join
> HashJoinSemi    -> SemiJoin(A, B) <-- existing join
> HashJoinAnti    -> ???  It can be a SemiJoin(A, B, anti) or brand-new
> AntiJoin(A, B)
> 
> In the first version of Correlate patch I plan to implement
> regular/left_outer Correlate expressions.
> I think semi and anti variants can be added later. SemiJoin is weird
> enough, so I think adding "anti" option to a SemiJoin is better than
> adding brand new AntiJoin.
> 
> The tricky part with anti join is `not in (nullable)` semantics,
> however that is a different story.
> 
> 5)
>> Since variables are only currently defined by nested-loops joins, I suppose we could have used the word "define" instead of "stop". But anyway. The concept is the same, whatever name we choose for it.
> 
> Got it. I will add this ASCII to the javadoc. Looks like Correlate
> will always set and stop exactly one variable, so the type of
> List<String> stopped is a bit over-engineered.
> 
> Vladimir


Re: Correlated vs Semi joins

Posted by Vladimir Sitnikov <si...@gmail.com>.
I am still playing with LogicalCorrelate and here's the current status:
0) The sample query and plans can be seen here:
https://gist.github.com/vlsi/e08f53b3fd0b3db465c0

1) Currently I use the following class hierarchy:
 Join -> BiRel -> AbstractRelNode
 Correlate -> BiRel -> AbstractRelNode
So, Correlate is not a Join.

2) Correlate node takes two inputs and _it_ assigns correlation
variable to the left input. Single Correlate always assigns just a
single variable. The var references the whole row, so no harm is made.
Additionally, Correlate holds a bitset of the required columns from
the left part.

3) I killed RelNode.setCorrelVariable,
RelNode.getOrCreateCorrelVariable, and RelNode.registerCorrelVariable.
It is error-prone to keep the variable during tree rewrites. The
required variable is always managed by Correlate node. This ensures
you won't forget to copy the variable from old relation to new one.

4) I plan to map execution operations as follows:
 NestedLoops     -> Correlate(A, B, regular)
 NestedLoopsOuter-> Correlate(A, B, left_outer)
 NestedLoopsSemi -> Correlate(A, B, semi)
 NestedLoopsAnti -> Correlate(A, B, anti)

 HashJoin        -> EquiJoin(A, B) <-- existing join
 HashJoinOuter   -> EquiJoin(A, B) <-- existing join
 HashJoinSemi    -> SemiJoin(A, B) <-- existing join
 HashJoinAnti    -> ???  It can be a SemiJoin(A, B, anti) or brand-new
AntiJoin(A, B)

In the first version of Correlate patch I plan to implement
regular/left_outer Correlate expressions.
I think semi and anti variants can be added later. SemiJoin is weird
enough, so I think adding "anti" option to a SemiJoin is better than
adding brand new AntiJoin.

The tricky part with anti join is `not in (nullable)` semantics,
however that is a different story.

5)
> Since variables are only currently defined by nested-loops joins, I suppose we could have used the word "define" instead of "stop". But anyway. The concept is the same, whatever name we choose for it.

Got it. I will add this ASCII to the javadoc. Looks like Correlate
will always set and stop exactly one variable, so the type of
List<String> stopped is a bit over-engineered.

Vladimir

Re: Correlated vs Semi joins

Posted by Julian Hyde <ju...@gmail.com>.
> On Nov 18, 2014, at 10:45 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> By the way, what is Join.getVariablesStopped? It is always initialized
> to empty set, however JoinUnionTransposeRule tries to verify if the
> set is not empty.

The notion of "stopping" variables is very important for correlated plans.

Consider this tree:

         R4
         /   \
       R3  R8
      /   \
    R2   R5
   /         /   \
R1      R6  R7

Let's suppose that R3 is a nested loop join. It sets a correlating variable V for each record it reads from R2, then it restarts R5 (which restarts its inputs R6 and R7) so they can see the new value. Let's suppose that R7 and R6 both read V.

V is only visible to R5, R6, R7.

It is marked as "stopped" by R3, which means that it doesn't propagate up the tree, and is not visible to R4 or R8.

Correlating variables complicate algebraic transformations in a big way, so we want to keep them as localized as possible. The is what "stopping" is all about.

Since variables are only currently defined by nested-loops joins, I suppose we could have used the word "define" instead of "stop". But anyway. The concept is the same, whatever name we choose for it.

Julian


Re: Correlated vs Semi joins

Posted by Vladimir Sitnikov <si...@gmail.com>.
> If they are only expecting a regular join we can't hand
> them a weird join such as a semi-join.

It is not just a problem with joins. It is a common problem since
Calcite tries to have some common denominator across execution
engines.

A system does not always support "grouping sets", so you might need to
transform it to "union all".
Not every system supports AVG/STDDEV_POP/etc functions, so you have
AggregateReduceFunctionsRule.

Ok, what am I trying to say? I think we might want to keep high-level
nodes around for ease of optimization, then lower "instruction set" if
required.
I do not like "expand early" approach since it might be a lot harder
to pattern-match a high-level concept out of low-level
filter-aggregate-project soup.

> Everyone who is handed a RelNode needs to understand exactly what they
> are getting.

Consider JoinUnionTransposeRule.
It matches things like operand(Join.class,
    operand(RelNode.class, any()),
    operand(Union.class, any()))
and tries to push join down to the union.

If I read the code correctly, it will blow up if it happen to run over
SemiJoin(A, UnionAll(B, C)).
The resulting UnionAll(SemiJoin(A, B), SemiJoin(A, C)) would be invalid.

By the way, what is Join.getVariablesStopped? It is always initialized
to empty set, however JoinUnionTransposeRule tries to verify if the
set is not empty.

Vladimir

Re: Correlated vs Semi joins

Posted by Julian Hyde <ju...@hydromatic.net>.
I don't know the answer. I don't think there is an easy way to manage
the complexity caused by lots of different kinds of relational
expressions.

I strongly feel that a "behavioral" approach (adding interfaces &
methods) is wrong. Relational expressions should be modeled using a
"data" approach (adding sub-classes and data members). For this
reason, our RelNodes are what in Scala would be called case-classes.
Data members are public and final, and there are few non-trivial
methods.

Everyone who is handed a RelNode needs to understand exactly what they
are getting. If they are only expecting a regular join we can't hand
them a weird join such as a semi-join.

That implies that weird relational expressions become their own
sub-classes and need to have special rules written for them. That
sounds expensive, but adding a new type of relational expression is
expensive, no matter how you try to distribute the cost.

I wrestled with this issue when introducing support for grouping sets.
In the end I added

  boolean indicator;
  ImmutableList<ImmutableBitSet> groupSets;

to Aggregate, and imposed a burden on all rules, present and future,
that match Aggregate. I only did this after I became convinced that
adding a new GroupingSetsAggregate relational expression would be
worse.

I haven't made up my my mind, but I am inclined to think that SemiJoin
should not be a join, but a top-level relational expression, because
its behavior has as much in common with Filter as Join.

Julian




On Tue, Nov 18, 2014 at 3:48 AM, Vladimir Sitnikov
<si...@gmail.com> wrote:
>> What is your point, exactly?
>
> I wonder if we can use some kind of bridge pattern [1] to make (semi,
> anti, regular) x (inner, left, right, full) x (correlated, regular) x
> (equi, theta) manageable.
>
> For instance, JoinRelType already factors (inner, left, right, full) part.
>
> What if we introduce fields in Join.class for:
> 1) joinType == (semi, anti, regular)
> 2) correlationType == (correlated, regular)  (it might be just a boolean)
>
> Then a single Join.class should be sufficient (no more SemiJoin & Co).
>
> Not sure about "(equi, theta)". Is this something a rule should know about?
> As far as I understand, theta join is exactly a correlated join with
> empty set of correlations. So I see no much reason adding theta into
> the equation.
>
>> You started off this thread arguing that there was not much point for
> a correlated semi-join.
>
> I would like to see correlated joins as well as anti joins added in
> Calcite ("...how semi-joins
> should play with correlated-joins")
> I just do not feel current model of adding join types (new Join
> subclass for each join type) is scalable enough to handle addition of
> correlated and anti joins.
>
> [1]: http://en.wikipedia.org/wiki/Bridge_pattern
>
> Vladimir

Re: Correlated vs Semi joins

Posted by Vladimir Sitnikov <si...@gmail.com>.
> What is your point, exactly?

I wonder if we can use some kind of bridge pattern [1] to make (semi,
anti, regular) x (inner, left, right, full) x (correlated, regular) x
(equi, theta) manageable.

For instance, JoinRelType already factors (inner, left, right, full) part.

What if we introduce fields in Join.class for:
1) joinType == (semi, anti, regular)
2) correlationType == (correlated, regular)  (it might be just a boolean)

Then a single Join.class should be sufficient (no more SemiJoin & Co).

Not sure about "(equi, theta)". Is this something a rule should know about?
As far as I understand, theta join is exactly a correlated join with
empty set of correlations. So I see no much reason adding theta into
the equation.

> You started off this thread arguing that there was not much point for
a correlated semi-join.

I would like to see correlated joins as well as anti joins added in
Calcite ("...how semi-joins
should play with correlated-joins")
I just do not feel current model of adding join types (new Join
subclass for each join type) is scalable enough to handle addition of
correlated and anti joins.

[1]: http://en.wikipedia.org/wiki/Bridge_pattern

Vladimir

Re: Correlated vs Semi joins

Posted by Julian Hyde <ju...@hydromatic.net>.
I'm not denying that a correlated semi-join would be useful. But
(semi, regular) x (inner, left, right, full) x (correlated, regular) x
(equi, theta) is a lot of combinations. If one component handles all
combinations it is bound to be more complicated.

>> A semi-join can be represented using a combination of join, aggregate
>> and project.
>
> Why do we need SemiJoin.class then?

Is that a serious question?

Yeah, we could do without it. We could do without Intersect and Minus,
too. But it would be harder to write rules to match interesting
patterns.

You started off this thread arguing that there was not much point for
a correlated semi-join. Now you seem to be arguing strenuously for it.
What is your point, exactly?

Julian

Re: Correlated vs Semi joins

Posted by Vladimir Sitnikov <si...@gmail.com>.
> It seems too complicated

What is so complicated?
When indices exists, correlated plans of different flavors are desirable.

> A semi-join can be represented using a combination of join, aggregate
> and project.

Why do we need SemiJoin.class then?

Vladimir

Re: Correlated vs Semi joins

Posted by Julian Hyde <ju...@gmail.com>.
A correlated semi-join would probably have well-defined semantics (it
would be basically EXISTS with a correlating variable) and might even
be useful but I agree with your instincts that you shouldn't build
one. It seems too complicated, and would not be used often enough to
get the bugs out.

A semi-join can be represented using a combination of join, aggregate
and project. So, I think a semi-join could, also. You would be
re-using simple, well-tested components, and that is better than
creating a complex, poorly tested component.

Julian


On Mon, Nov 17, 2014 at 12:37 AM, Vladimir Sitnikov
<si...@gmail.com> wrote:
> Hi,
>
> As I develop EnumerableCorrelate executor, I struggle how semi-joins
> should play with correlated-joins.
>
> For instance: we have SemiJoin (that is basically a HashJoinSemi) and
> CorrelateRel (that is NestedLoops). I wonder how one should represent
> CorrelateSemiJoin (==NestedLoopsSemi) that is both correlated and
> semi-join.
> Ok, suppose we solve this somehow, then AntiJoin comes in with
> correlated (e.g NestedLoopsAnti) and non-correlated (e.g.
> HashJoinAnti) versions.
>
> I am not sure if just adding these variations to the core would do the
> trick since this would explode the rules.
> E.g. the rule that cares of semi- wants to treat NestedLoopsSemi and
> HashJoinSemi as semi-join and the rule that cares of correlations
> (e.g. the one that de-correlates) does not care much if the join is
> regular semi-join or anti-join.
>
>
> --
> Regards,
> Vladimir Sitnikov