You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by 魏阔 <we...@alibaba-inc.com> on 2017/07/06 06:22:01 UTC

depedent join implementation

Hi all:    Depedent join performance is a huge challenge in processing multi-source joins. Instead of reading all of source A and all of source B, and joining them on A.x = B.x, we want to read all of A then build a set of A.x that are passed as a criteria when querying B. In cases where A is small and B is large, this can drastically reduce the data retrived from B,thus greatly speeding the overall query.     I saw the similar idea implemented in Apache Teiid, is there a similar rule to do so in Calcite? We want to implement this, but still havn't thought clearly, any suggestions ?
thanks!shanyao

Re: depedent join implementation

Posted by Julian Hyde <jh...@apache.org>.

I think you’re talk about this: https://issues.apache.org/jira/browse/CALCITE-468 <https://issues.apache.org/jira/browse/CALCITE-468>

The essence of the trick is to rewrite “A join B” to “A join (B semi-join A’)” where A’ is a safe sub-set of A, perhaps “select distinct id from A”, and is much smaller than A. It’s OK for A’ to have a few false positives (i.e. keys that do not occur in A) and therefore Bloom filters are a good option.

Julian
 

> On Jul 5, 2017, at 11:22 PM, 魏阔 <we...@alibaba-inc.com> wrote:
> 
> Hi all:    Depedent join performance is a huge challenge in processing multi-source joins. Instead of reading all of source A and all of source B, and joining them on A.x = B.x, we want to read all of A then build a set of A.x that are passed as a criteria when querying B. In cases where A is small and B is large, this can drastically reduce the data retrived from B,thus greatly speeding the overall query.     I saw the similar idea implemented in Apache Teiid, is there a similar rule to do so in Calcite? We want to implement this, but still havn't thought clearly, any suggestions ?
> thanks!shanyao