You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Bruce Bian <we...@gmail.com> on 2012/05/22 17:07:38 UTC

Condition for doing a sort merge bucket map join

Hi ,
I've got 7 large tables to join(each ~10G in size) into one table, all with
the same* 2 *join keys, I've read some documents on sort merge bucket map
join, but failed to fire that.
I've bucketed all the 7 tables into 20 buckets and sorted  by one of the
join key,
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
Set the above parameters while doing the join.
What else do I miss? Do I have to bucket on both of the join keys(I'm
currently trying this)? And does each bucket file has to be smaller than
one HDFS block?
Thanks a lot.

Re: Condition for doing a sort merge bucket map join

Posted by ameet chaubal <am...@yahoo.com>.

you should have the bucket columns = join keys = sort columns. When this condition is true, I was able to make SMB work.
Even if one of the join keys is a partition (i.e. cannot be part of clustering/sorting set), it did not work for me.
So, I'd say just check that all the 7 table joins use the same join keys which are all clustered/sorted.
 
Sincerely,


Ameet


________________________________
 From: Bruce Bian <we...@gmail.com>
To: user@hive.apache.org 
Sent: Tuesday, May 22, 2012 11:07 AM
Subject: Condition for doing a sort merge bucket map join
 

Hi ,
I've got 7 large tables to join(each ~10G in size) into one table, all with the same 2 join keys, I've read some documents on sort merge bucket map join, but failed to fire that.
I've bucketed all the 7 tables into 20 buckets and sorted  by one of the join key,
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; 
Set the above parameters while doing the join.
What else do I miss? Do I have to bucket on both of the join keys(I'm currently trying this)? And does each bucket file has to be smaller than one HDFS block?
Thanks a lot.

Re: Condition for doing a sort merge bucket map join

Posted by Mark Grover <mg...@oanda.com>.

Hi Bruce,
Instead of joining 7 tables in the query, can you please start off with 2 tables and see if that works? If it doesn't, feel free to paste your table definitions and join query along with any properties you are setting and folks on the mailing list can take a jab at it.

Mark

----- Original Message -----
From: "Bruce Bian" <we...@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, May 22, 2012 11:07:38 AM
Subject: Condition for doing a sort merge bucket map join

Hi , 
I've got 7 large tables to join(each ~10G in size) into one table, all with the same 2 join keys, I've read some documents on sort merge bucket map join, but failed to fire that. 
I've bucketed all the 7 tables into 20 buckets and sorted by one of the join key, 
set hive.optimize.bucketmapjoin = true; 
set hive.optimize.bucketmapjoin.sortedmerge = true; 
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; 
Set the above parameters while doing the join. 
What else do I miss? Do I have to bucket on both of the join keys(I'm currently trying this)? And does each bucket file has to be smaller than one HDFS block? 
Thanks a lot.