You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Julian Hyde <ju...@hydromatic.net> on 2014/08/08 20:06:14 UTC

Synthetic semi-join

I’m looking into using semi-join in Hive. My main focus is where semi-joins are specified in the query (explicitly or via IN/EXISTS) but synthetic semi-joins are interesting.

There is a logical rule, AddRedundantSemiJoinRule. It transforms JoinRel(X, Y) —> JoinRel(SemiJoinRel(X, Y), Y).

This transformation might be worth the effort if X is large, Y is small (i.e. candidate for map-join), and (I’m guessing) the NDV of the join key is small.

Three questions:
1. Have I characterized the sweet spot of the transformation correctly?
2. Are there any benchmark or typical queries where this would be useful?
3. Do we have good enough stats to consider applying this rule?

I’ll capture the results of the discussion in a JIRA.

Julian


Re: Synthetic semi-join

Posted by Vladimir Sitnikov <si...@gmail.com>.
Ok, I got the idea.
I am afraid this is not an everyday approach to join tables, thus we might
need some guards to avoid explosion of the search space.

>when it comes to implementing the semi-join, bloom filters are a good
option, if you can accept an approximate answer.
bloom filters can be a good option even for regular joins.
​
Vladimir

Re: Synthetic semi-join

Posted by Vladimir Sitnikov <si...@gmail.com>.
This reminds me Oracle's partitioned outer join [1]. Note: it has nothing
to do with table/index partitioning.
This _kind_ of join resembles SJR (e.g. it calculates distinct Y first,
then joins), but it even results into something useful.
I've faced a case or two when partitioned outer join was the proper tool.

[1]: http://www.oracle-developer.net/display.php?id=312
​
Vladimir

Re: Synthetic semi-join

Posted by Mostafa Mokhtar <mm...@hortonworks.com>.
This concept is named SJR  (Semi Join reduction) , this paper covers the
concept in detail
http://www-db.in.tum.de/research/publications/conferences/semijoin.pdf.

As Vladimir mentioned SJR analogous to bloom filters but the dimension
table itself is used opposed to using just a bloom filter.

TPC-H queries 17 and 20 are good candidates for semi join reduction.

This feature should definitely be on our roadmap.

Thanks
Mostafa




On Fri, Aug 8, 2014 at 1:04 PM, Julian Hyde <ju...@gmail.com> wrote:

> You’re right that bloom filters are useful. I was just exploring what
> could be done at the logical level; when it comes to implementing the
> semi-join, bloom filters are a good option, if you can accept an
> approximate answer.
>
> Here’s a scenario where it would make sense to transform JoinRel(X, Y) —>
> JoinRel(SemiJoinRel(X, Y), Y). Let’s suppose that Y has a large number of
> rows and columns (i.e. the average row length is large). We can ship the
> set of distinct Y key values to X, semi-join them, then send the filtered X
> rows to Y.
>
> So, SemiJoin(X, Y) has significantly lower I/O cost than Join(X, Y) even
> though it reads the same number of rows from X and Y, because it reads
> fewer columns from Y.
>
> We’ve replaced one shuffle join with two map joins.
>
> Julian

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Synthetic semi-join

Posted by Julian Hyde <ju...@gmail.com>.
You’re right that bloom filters are useful. I was just exploring what could be done at the logical level; when it comes to implementing the semi-join, bloom filters are a good option, if you can accept an approximate answer.

Here’s a scenario where it would make sense to transform JoinRel(X, Y) —> JoinRel(SemiJoinRel(X, Y), Y). Let’s suppose that Y has a large number of rows and columns (i.e. the average row length is large). We can ship the set of distinct Y key values to X, semi-join them, then send the filtered X rows to Y.

So, SemiJoin(X, Y) has significantly lower I/O cost than Join(X, Y) even though it reads the same number of rows from X and Y, because it reads fewer columns from Y.

We’ve replaced one shuffle join with two map joins.

Julian

Re: Synthetic semi-join

Posted by Vladimir Sitnikov <si...@gmail.com>.
>X is large, Y is small (i.e. candidate for map-join)
Why don't you just do a map-join then? Why that map-semijoin would improve
anything?

Is there something specific why you want using SemiJoins?

It looks like you are trying to eliminate some rows before the actual join
starts, aren't you?
I guess Bloom filters are used in this area. You can tune the knob of
memory to cpu (e.g. allocate more memory to Bloom and get less
false-positives).

JoinRel(X, Y) -> JoinRel(BloomBuild(X), BloomFilter(Y))
In other words, you build the Bloom filter while scanning X and use that
filter to eliminate rows from Y.

Vladimir