You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by 贺小令 <go...@163.com> on 2017/11/14 13:14:51 UTC

why does SemiJoin only support equi join condition in Calcite ?

Hello, I am trying to convert IN or EXISTS to SemiJoin based on Calcite. (such as executing tpch-21.sql)However, SemiJoin only supports  equi join condition. I also find that SemiJoin extends from EquiJoin.I wondered why does SemiJoin only support equi join condition in Calcite? As we know, "A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned."  (ref http://dbspecialists.com/speeding-queries-semi-joins-anti-joins-oracle-evaluates-exists-not-exists-not)


e.g.  select * from l where l.a in (select r.c from r where l.b > r.d)
According to the definition, the above SQL can be converted to semi-join. There is an equi join condition: l.a = r.c, which can be used as shuffle key on distributed environment. There is a non-equi join conditions: lb. > r.d, so it can not be converted to SemiJoin in Calcite now.


Does SemiJoin support equi join condition temporarily? Or is there some reasons we must do as that?


thanks a lot,
godfreyhe










 

Re: why does SemiJoin only support equi join condition in Calcite ?

Posted by Julian Hyde <jh...@apache.org>.
You are talking about what I would call “theta semi-join” (as opposed to the “equi-semi-join” that we support currently). But you are correct; there is no reason in principle why the predicate has to be “=“.

IN translates naturally to an equi-SemiJoin; your query "select * from l where l.a in (select r.c from r where l.b > r.d)” would start off as an equi-SemiJoin whose right-hand input is a filter, but we could pull the filter condition into the SemiJoin, in which case it would become a theta-SemiJoin because the condition is no longer a simple “=“.

A simple way to implement semi-join is to apply an Aggregate the right-hand input (to eliminate duplicates) followed by a regular Join, followed by a Project. You can’t use that simple approach for theta-SemiJoin, because you don’t know what should be the key for the Aggregate. We could of course devise another means. But personally, I find it more difficult to reason about ThetaSemiJoin because it doesn’t have that convenient identity.

SemiJoin is not a fundamental relational operator; you can accomplish it using other operators. So we have it only because it makes things easier; some rules can be applied to semi-joins that do not apply to regular joins, and we can make better estimates for statistics. If it makes things easier, we could also introduce a ThetaSemiJoin, and we could revisit rules that apply to Join and SemiJoin and see whether they apply to ThetaSemiJoin.

Julian

> On Nov 14, 2017, at 5:14 AM, 贺小令 <go...@163.com> wrote:
> 
> Hello, I am trying to convert IN or EXISTS to SemiJoin based on Calcite. (such as executing tpch-21.sql)However, SemiJoin only supports  equi join condition. I also find that SemiJoin extends from EquiJoin.I wondered why does SemiJoin only support equi join condition in Calcite? As we know, "A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned."  (ref http://dbspecialists.com/speeding-queries-semi-joins-anti-joins-oracle-evaluates-exists-not-exists-not)
> 
> 
> e.g.  select * from l where l.a in (select r.c from r where l.b > r.d)
> According to the definition, the above SQL can be converted to semi-join. There is an equi join condition: l.a = r.c, which can be used as shuffle key on distributed environment. There is a non-equi join conditions: lb. > r.d, so it can not be converted to SemiJoin in Calcite now.
> 
> 
> Does SemiJoin support equi join condition temporarily? Or is there some reasons we must do as that?
> 
> 
> thanks a lot,
> godfreyhe
> 
> 
> 
> 
> 
> 
> 
> 
> 
>