You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Yuzhao Chen <yu...@gmail.com> on 2019/04/22 13:08:51 UTC

Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

EnumerableJoin EnumerableMergeJoin EnumerableSemiJoin all only support equi-join condition[1], and there are some
rules that match EquiJoin specificly, like FilterJoinRule will not push non-equi join predicates into the join condition[2], so what is the original intention that these join enumetables only have equi-join implementation ?

Do you think there is necessity to support non-join conditions for these joins ? I just found that there is a PR to support a hash-theta-join for theta join with equi-join condition[3].


[1] https://github.com/apache/calcite/blob/ee83efd360793ef4201f4cdfc2af8d837b76ca69/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L1125
[2] https://github.com/apache/calcite/blob/ee83efd360793ef4201f4cdfc2af8d837b76ca69/core/src/main/java/org/apache/calcite/rel/rules/FilterJoinRule.java#L165
[3] https://github.com/apache/calcite/pull/1156

Best,
Danny Chan

Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Yuzhao Chen <yu...@gmail.com>.
Thx, Stamatis

Finally I think that [3] will provide better compatibility, just like you said that some engines may not support non-equi HashJoin, so pushing into non-equi join conditions seems not that necessary.

The better way is to give more join algorithm :), I will try my best to implement more algorithms if I have time.

Best,
Danny Chan
在 2019年4月26日 +0800 PM9:29,Stamatis Zampetakis <za...@gmail.com>,写道:
> Hi Danny,
>
> Yes, I agree that we don't need two variations of hash joins
> (EnumerableHashJoin and EnumerableThetaHashJoin) as it is the code right
> now.
>
> Having that said, I am not sure if we can really handle the general case of
> theta joins with hash-based algorithms.
>
> The approach in [3] does not really handle general theta joins but allows
> to split a theta condition into equi and non-equi parts where the non-equi
> part is executed as a post-join filter.
> Since the post-join filter is part of the join operator we can say that we
> have a theta hash join operator but this is partially true.
>
> For instance, the following query:
>
> SELECT e1.name
> FROM emp e1
> INNER JOIN emp e2
> ON e1.salary < e2.salary
>
> cannot use at all hash join algorithm [3] because there is no equality
> condition.
>
> I wouldn't mind keeping EnumerableHashJoin only for equijoins and execute
> the post-join filter as such but this would make outer joins
> difficult/impossible to handle so I believe [3] is a good step forward.
>
> Best,
> Stamatis
>
>
>
>
>
> On Wed, Apr 24, 2019 at 6:28 AM Yuzhao Chen <yu...@gmail.com> wrote:
>
> > Julian,
> >
> > I also agree with you that we should distinguish between
> > EnumerableHashJoin and EnumerableMergeJoin, what i want to address about is
> > if we should extend the EnumerableHashJoin and EnumerableMergeJoin to let
> > them support Thera join conditions except the EquiJoin conditions.
> >
> > It seems that Stamatis also agree that we can extend the existing
> > EnumerableXXXs, correct me if i have wrong understanding.
> >
> >
> > Best,
> > Danny Chan
> > 在 2019年4月24日 +0800 AM2:18,Julian Hyde <jh...@apache.org>,写道:
> > > I agree with Stamatis.
> > >
> > > Let’s suppose we have a merge join and hash join in enumerable
> > convention. The merge join requires its input to be sorted, the hash join
> > does not. They can both perform equi-join. The merge join can also perform
> > joins like
> > >
> > > FROM orders
> > > JOIN shipments
> > > ON shipments.shipDate BETWEEN orders.orderDate
> > > AND orders.orderDate + INTERVAL ‘7’ DAY
> > >
> > > Should we use the same sub-class of RelNode, EnumerableJoin, for both,
> > or should we create sub-classes EnumerableHashJoin and EnumerableMergeJoin?
> > >
> > > I think we should do the latter:
> > > * They are clearly different algorithms, and we generate different code
> > for them
> > > * They have different cost models
> > > * They have different input and output traits (the merge join requires
> > sorted inputs and produces a sorted output)
> > > * If we have an equi-join, we might want to generate both options and
> > see how they compare (the merge join might win if one or both of the inputs
> > are sorted), and both algorithms used the EnumerableJoin class we would not
> > be able to distinguish the RelNode instances.
> > >
> > > Julian
> > >
> > >
> > > > On Apr 23, 2019, at 11:06 AM, Stamatis Zampetakis <za...@gmail.com>
> > wrote:
> > > >
> > > > In the literature, there are many algorithms to perform a join and the
> > vast
> > > > majority of them cannot handle theta conditions (either for performance
> > > > reasons, or because the algorithm was simply not meant for it).
> > > >
> > > > For the existing EnumerableXXXJoin variants, it may be possible to
> > extend
> > > > the algorithm to handle theta joins without structural changes to the
> > > > algorithm (as I think it is the case in [3]).
> > > > For these cases, I agree with you it does not make sense to have many
> > > > variants (e.g., EnumerableThetaHashJoin seems redundant).
> > > >
> > > > In the future we may decide to use other hash-based implementations
> > with
> > > > certain applicability restrictions (such as apply only on equijoins)
> > but
> > > > let's not worry too much about that now.
> > > >
> > > > Regarding the rules, I think it is useful to have different variants.
> > > > Consider for example a system that does not have join processing
> > algorithms
> > > > that can treat theta-joins; pushing non-equi conditions in a join seems
> > > > undesirable in this case.
> > > >
> > > > For more information regarding join processing, there are a few
> > interesting
> > > > (and old) surveys [4, 5] which outline some of the most typical
> > algorithms
> > > > that are used in relational databases.
> > > >
> > > > Best,
> > > > Stamatis
> > > >
> > > > [4] Query evaluation techniques for Large databases (
> > > > https://web.stanford.edu/class/cs346/2014/graefe.pdf)
> > > > [5] Join processing in relational databases (
> > > > https://www.csd.uoc.gr/~hy460/pdf/p63-mishra.pdf)
> > > >
> > > >
> > > > On Tue, Apr 23, 2019 at 8:05 AM Yuzhao Chen <yu...@gmail.com>
> > wrote:
> > > >
> > > > > Thx, Julian
> > > > >
> > > > > Why not just support non-equi join condition for every physical
> > algorithm,
> > > > > it does not make much sense if we have both HashJoin and a
> > HashTheraJoin,
> > > > > cause a HashThataJoin with empty non-equi join condition is same as a
> > > > > HashJoin.
> > > > >
> > > > > And we can remove the limitations in the rule like FilterJoinRule.
> > > > >
> > > > > Best,
> > > > >
> > > > > Danny Chan
> > > > > 在 2019年4月23日 +0800 AM3:21,dev@calcite.apache.org,写道:
> > > > > >
> > > > > > If there are limitations, over time we would like to remove those
> > > > > limitations, but we will probably do it by adding new algorithms, and
> > > > > therefore new EnumerableXxx classes.
> > > > >
> > >
> >

Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Stamatis Zampetakis <za...@gmail.com>.
Hi Danny,

Yes, I agree that we don't need two variations of hash joins
(EnumerableHashJoin and EnumerableThetaHashJoin) as it is the code right
now.

Having that said, I am not sure if we can really handle the general case of
theta joins with hash-based algorithms.

The approach in [3] does not really handle general theta joins but allows
to split a theta condition into equi and non-equi parts where the non-equi
part is executed as a post-join filter.
Since the post-join filter is part of the join operator we can say that we
have a theta hash join operator but this is partially true.

For instance, the following query:

SELECT e1.name
FROM emp e1
INNER JOIN emp e2
  ON e1.salary < e2.salary

cannot use at all hash join algorithm [3] because there is no equality
condition.

I wouldn't mind keeping EnumerableHashJoin only for equijoins and execute
the post-join filter as such but this would make outer joins
difficult/impossible to handle so I believe [3] is a good step forward.

Best,
Stamatis





On Wed, Apr 24, 2019 at 6:28 AM Yuzhao Chen <yu...@gmail.com> wrote:

> Julian,
>
> I also agree with you that we should distinguish between
> EnumerableHashJoin and EnumerableMergeJoin, what i want to address about is
> if we should extend the EnumerableHashJoin and EnumerableMergeJoin to let
> them support Thera join conditions except the EquiJoin conditions.
>
> It seems that Stamatis also agree that we can extend the existing
> EnumerableXXXs, correct me if i have wrong understanding.
>
>
> Best,
> Danny Chan
> 在 2019年4月24日 +0800 AM2:18,Julian Hyde <jh...@apache.org>,写道:
> > I agree with Stamatis.
> >
> > Let’s suppose we have a merge join and hash join in enumerable
> convention. The merge join requires its input to be sorted, the hash join
> does not. They can both perform equi-join. The merge join can also perform
> joins like
> >
> > FROM orders
> > JOIN shipments
> > ON shipments.shipDate BETWEEN orders.orderDate
> > AND orders.orderDate + INTERVAL ‘7’ DAY
> >
> > Should we use the same sub-class of RelNode, EnumerableJoin, for both,
> or should we create sub-classes EnumerableHashJoin and EnumerableMergeJoin?
> >
> > I think we should do the latter:
> > * They are clearly different algorithms, and we generate different code
> for them
> > * They have different cost models
> > * They have different input and output traits (the merge join requires
> sorted inputs and produces a sorted output)
> > * If we have an equi-join, we might want to generate both options and
> see how they compare (the merge join might win if one or both of the inputs
> are sorted), and both algorithms used the EnumerableJoin class we would not
> be able to distinguish the RelNode instances.
> >
> > Julian
> >
> >
> > > On Apr 23, 2019, at 11:06 AM, Stamatis Zampetakis <za...@gmail.com>
> wrote:
> > >
> > > In the literature, there are many algorithms to perform a join and the
> vast
> > > majority of them cannot handle theta conditions (either for performance
> > > reasons, or because the algorithm was simply not meant for it).
> > >
> > > For the existing EnumerableXXXJoin variants, it may be possible to
> extend
> > > the algorithm to handle theta joins without structural changes to the
> > > algorithm (as I think it is the case in [3]).
> > > For these cases, I agree with you it does not make sense to have many
> > > variants (e.g., EnumerableThetaHashJoin seems redundant).
> > >
> > > In the future we may decide to use other hash-based implementations
> with
> > > certain applicability restrictions (such as apply only on equijoins)
> but
> > > let's not worry too much about that now.
> > >
> > > Regarding the rules, I think it is useful to have different variants.
> > > Consider for example a system that does not have join processing
> algorithms
> > > that can treat theta-joins; pushing non-equi conditions in a join seems
> > > undesirable in this case.
> > >
> > > For more information regarding join processing, there are a few
> interesting
> > > (and old) surveys [4, 5] which outline some of the most typical
> algorithms
> > > that are used in relational databases.
> > >
> > > Best,
> > > Stamatis
> > >
> > > [4] Query evaluation techniques for Large databases (
> > > https://web.stanford.edu/class/cs346/2014/graefe.pdf)
> > > [5] Join processing in relational databases (
> > > https://www.csd.uoc.gr/~hy460/pdf/p63-mishra.pdf)
> > >
> > >
> > > On Tue, Apr 23, 2019 at 8:05 AM Yuzhao Chen <yu...@gmail.com>
> wrote:
> > >
> > > > Thx, Julian
> > > >
> > > > Why not just support non-equi join condition for every physical
> algorithm,
> > > > it does not make much sense if we have both HashJoin and a
> HashTheraJoin,
> > > > cause a HashThataJoin with empty non-equi join condition is same as a
> > > > HashJoin.
> > > >
> > > > And we can remove the limitations in the rule like FilterJoinRule.
> > > >
> > > > Best,
> > > >
> > > > Danny Chan
> > > > 在 2019年4月23日 +0800 AM3:21,dev@calcite.apache.org,写道:
> > > > >
> > > > > If there are limitations, over time we would like to remove those
> > > > limitations, but we will probably do it by adding new algorithms, and
> > > > therefore new EnumerableXxx classes.
> > > >
> >
>

Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Yuzhao Chen <yu...@gmail.com>.
Julian,

I also agree with you that we should distinguish between EnumerableHashJoin and EnumerableMergeJoin, what i want to address about is if we should extend the EnumerableHashJoin and EnumerableMergeJoin to let them support Thera join conditions except the EquiJoin conditions.

It seems that Stamatis also agree that we can extend the existing EnumerableXXXs, correct me if i have wrong understanding.


Best,
Danny Chan
在 2019年4月24日 +0800 AM2:18,Julian Hyde <jh...@apache.org>,写道:
> I agree with Stamatis.
>
> Let’s suppose we have a merge join and hash join in enumerable convention. The merge join requires its input to be sorted, the hash join does not. They can both perform equi-join. The merge join can also perform joins like
>
> FROM orders
> JOIN shipments
> ON shipments.shipDate BETWEEN orders.orderDate
> AND orders.orderDate + INTERVAL ‘7’ DAY
>
> Should we use the same sub-class of RelNode, EnumerableJoin, for both, or should we create sub-classes EnumerableHashJoin and EnumerableMergeJoin?
>
> I think we should do the latter:
> * They are clearly different algorithms, and we generate different code for them
> * They have different cost models
> * They have different input and output traits (the merge join requires sorted inputs and produces a sorted output)
> * If we have an equi-join, we might want to generate both options and see how they compare (the merge join might win if one or both of the inputs are sorted), and both algorithms used the EnumerableJoin class we would not be able to distinguish the RelNode instances.
>
> Julian
>
>
> > On Apr 23, 2019, at 11:06 AM, Stamatis Zampetakis <za...@gmail.com> wrote:
> >
> > In the literature, there are many algorithms to perform a join and the vast
> > majority of them cannot handle theta conditions (either for performance
> > reasons, or because the algorithm was simply not meant for it).
> >
> > For the existing EnumerableXXXJoin variants, it may be possible to extend
> > the algorithm to handle theta joins without structural changes to the
> > algorithm (as I think it is the case in [3]).
> > For these cases, I agree with you it does not make sense to have many
> > variants (e.g., EnumerableThetaHashJoin seems redundant).
> >
> > In the future we may decide to use other hash-based implementations with
> > certain applicability restrictions (such as apply only on equijoins) but
> > let's not worry too much about that now.
> >
> > Regarding the rules, I think it is useful to have different variants.
> > Consider for example a system that does not have join processing algorithms
> > that can treat theta-joins; pushing non-equi conditions in a join seems
> > undesirable in this case.
> >
> > For more information regarding join processing, there are a few interesting
> > (and old) surveys [4, 5] which outline some of the most typical algorithms
> > that are used in relational databases.
> >
> > Best,
> > Stamatis
> >
> > [4] Query evaluation techniques for Large databases (
> > https://web.stanford.edu/class/cs346/2014/graefe.pdf)
> > [5] Join processing in relational databases (
> > https://www.csd.uoc.gr/~hy460/pdf/p63-mishra.pdf)
> >
> >
> > On Tue, Apr 23, 2019 at 8:05 AM Yuzhao Chen <yu...@gmail.com> wrote:
> >
> > > Thx, Julian
> > >
> > > Why not just support non-equi join condition for every physical algorithm,
> > > it does not make much sense if we have both HashJoin and a HashTheraJoin,
> > > cause a HashThataJoin with empty non-equi join condition is same as a
> > > HashJoin.
> > >
> > > And we can remove the limitations in the rule like FilterJoinRule.
> > >
> > > Best,
> > >
> > > Danny Chan
> > > 在 2019年4月23日 +0800 AM3:21,dev@calcite.apache.org,写道:
> > > >
> > > > If there are limitations, over time we would like to remove those
> > > limitations, but we will probably do it by adding new algorithms, and
> > > therefore new EnumerableXxx classes.
> > >
>

Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Julian Hyde <jh...@apache.org>.
I agree with Stamatis.

Let’s suppose we have a merge join and hash join in enumerable convention. The merge join requires its input to be sorted, the hash join does not. They can both perform equi-join. The merge join can also perform joins like

  FROM orders
  JOIN shipments
  ON shipments.shipDate BETWEEN orders.orderDate
    AND orders.orderDate + INTERVAL ‘7’ DAY

Should we use the same sub-class of RelNode, EnumerableJoin, for both, or should we create sub-classes EnumerableHashJoin and EnumerableMergeJoin?

I think we should do the latter:
* They are clearly different algorithms, and we generate different code for them
* They have different cost models
* They have different input and output traits (the merge join requires sorted inputs and produces a sorted output)
* If we have an equi-join, we might want to generate both options and see how they compare (the merge join might win if one or both of the inputs are sorted), and both algorithms used the EnumerableJoin class we would not be able to distinguish the RelNode instances.

Julian


> On Apr 23, 2019, at 11:06 AM, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> In the literature, there are many algorithms to perform a join and the vast
> majority of them cannot handle theta conditions (either for performance
> reasons, or because the algorithm was simply not meant for it).
> 
> For the existing EnumerableXXXJoin variants, it may be possible to extend
> the algorithm to handle theta joins without structural changes to the
> algorithm (as I think it is the case in [3]).
> For these cases, I agree with you it does not make sense to have many
> variants (e.g., EnumerableThetaHashJoin seems redundant).
> 
> In the future we may decide to use other hash-based implementations with
> certain applicability restrictions (such as apply only on equijoins) but
> let's not worry too much about that now.
> 
> Regarding the rules, I think it is useful to have different variants.
> Consider for example a system that does not have join processing algorithms
> that can treat theta-joins; pushing non-equi conditions in a join seems
> undesirable in this case.
> 
> For more information regarding join processing, there are a few interesting
> (and old) surveys [4, 5] which outline some of the most typical algorithms
> that are used in relational databases.
> 
> Best,
> Stamatis
> 
> [4] Query evaluation techniques for Large databases (
> https://web.stanford.edu/class/cs346/2014/graefe.pdf)
> [5] Join processing in relational databases (
> https://www.csd.uoc.gr/~hy460/pdf/p63-mishra.pdf)
> 
> 
> On Tue, Apr 23, 2019 at 8:05 AM Yuzhao Chen <yu...@gmail.com> wrote:
> 
>> Thx, Julian
>> 
>> Why not just support non-equi join condition for every physical algorithm,
>> it does not make much sense if we have both HashJoin and a HashTheraJoin,
>> cause a HashThataJoin with empty non-equi join condition is same as  a
>> HashJoin.
>> 
>> And we can remove the limitations in the rule like FilterJoinRule.
>> 
>> Best,
>> 
>> Danny Chan
>> 在 2019年4月23日 +0800 AM3:21,dev@calcite.apache.org,写道:
>>> 
>>> If there are limitations, over time we would like to remove those
>> limitations, but we will probably do it by adding new algorithms, and
>> therefore new EnumerableXxx classes.
>> 


Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Stamatis Zampetakis <za...@gmail.com>.
In the literature, there are many algorithms to perform a join and the vast
majority of them cannot handle theta conditions (either for performance
reasons, or because the algorithm was simply not meant for it).

For the existing EnumerableXXXJoin variants, it may be possible to extend
the algorithm to handle theta joins without structural changes to the
algorithm (as I think it is the case in [3]).
For these cases, I agree with you it does not make sense to have many
variants (e.g., EnumerableThetaHashJoin seems redundant).

In the future we may decide to use other hash-based implementations with
certain applicability restrictions (such as apply only on equijoins) but
let's not worry too much about that now.

Regarding the rules, I think it is useful to have different variants.
Consider for example a system that does not have join processing algorithms
that can treat theta-joins; pushing non-equi conditions in a join seems
undesirable in this case.

For more information regarding join processing, there are a few interesting
(and old) surveys [4, 5] which outline some of the most typical algorithms
that are used in relational databases.

Best,
Stamatis

[4] Query evaluation techniques for Large databases (
https://web.stanford.edu/class/cs346/2014/graefe.pdf)
[5] Join processing in relational databases (
https://www.csd.uoc.gr/~hy460/pdf/p63-mishra.pdf)


On Tue, Apr 23, 2019 at 8:05 AM Yuzhao Chen <yu...@gmail.com> wrote:

> Thx, Julian
>
> Why not just support non-equi join condition for every physical algorithm,
> it does not make much sense if we have both HashJoin and a HashTheraJoin,
> cause a HashThataJoin with empty non-equi join condition is same as  a
> HashJoin.
>
> And we can remove the limitations in the rule like FilterJoinRule.
>
> Best,
>
> Danny Chan
> 在 2019年4月23日 +0800 AM3:21,dev@calcite.apache.org,写道:
> >
> > If there are limitations, over time we would like to remove those
> limitations, but we will probably do it by adding new algorithms, and
> therefore new EnumerableXxx classes.
>

Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Yuzhao Chen <yu...@gmail.com>.
Thx, Julian

Why not just support non-equi join condition for every physical algorithm, it does not make much sense if we have both HashJoin and a HashTheraJoin, cause a HashThataJoin with empty non-equi join condition is same as  a HashJoin.

And we can remove the limitations in the rule like FilterJoinRule.

Best,

Danny Chan
在 2019年4月23日 +0800 AM3:21,dev@calcite.apache.org,写道:
>
> If there are limitations, over time we would like to remove those limitations, but we will probably do it by adding new algorithms, and therefore new EnumerableXxx classes.

Re: Support non-equi join condition for EnumerableJoin, EnumerableMergeJoin and EnumerableSemiJoin ? Just a discussing..

Posted by Julian Hyde <jh...@apache.org>.
The EnumerableXxxx RelNodes are algorithms. To be efficient, an algorithm makes particular assumptions (e.g. the condition is based on equality).

If there are limitations, over time we would like to remove those limitations, but we will probably do it by adding new algorithms, and therefore new EnumerableXxx classes.

In https://issues.apache.org/jira/browse/CALCITE-2969 <https://issues.apache.org/jira/browse/CALCITE-2969> I see you are proposing to rename EnumerableJoin to EnumerableHashJoin, and similar renames. That is good. It makes clear the limitations of the current algorithm, and makes room for alternative algorithms.

Julian


> On Apr 22, 2019, at 6:08 AM, Yuzhao Chen <yu...@gmail.com> wrote:
> 
> EnumerableJoin EnumerableMergeJoin EnumerableSemiJoin all only support equi-join condition[1], and there are some
> rules that match EquiJoin specificly, like FilterJoinRule will not push non-equi join predicates into the join condition[2], so what is the original intention that these join enumetables only have equi-join implementation ?
> 
> Do you think there is necessity to support non-join conditions for these joins ? I just found that there is a PR to support a hash-theta-join for theta join with equi-join condition[3].
> 
> 
> [1] https://github.com/apache/calcite/blob/ee83efd360793ef4201f4cdfc2af8d837b76ca69/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L1125
> [2] https://github.com/apache/calcite/blob/ee83efd360793ef4201f4cdfc2af8d837b76ca69/core/src/main/java/org/apache/calcite/rel/rules/FilterJoinRule.java#L165
> [3] https://github.com/apache/calcite/pull/1156
> 
> Best,
> Danny Chan