You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Ido Hadanny <id...@gmail.com> on 2011/08/05 10:24:50 UTC

Is a collocated join (a-la-netezza) theoretically possible in hive?

When you join tables which are distributed on the same key and used these
key columns in the join condition, then each SPU (machine) in netezza works
100% independent of the other (see
nz-interview<http://www.folkstalk.com/2011/06/netezza-interview-questions-part-2.html>
.)

In hive, there's bucketed map
join<https://issues.apache.org/jira/browse/HIVE-917>,
but the distribution of the files representing the tables to datanode is the
responsibility of HDFS, it's not done according to hive CLUSTERED BY key!

so suppose I have 2 tables, CLUSTERED BY the same key, and I join by that
key - can hive get a guarantee from HDFS that matching buckets will sit on
the same node? or will it always have to move the matching bucket of the
small table to the datanode containing the big table bucket?

Thanks, ido