You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Abbass MAROUNI <ab...@virtualscale.fr> on 2014/05/21 14:26:23 UTC

Sort Merge Bucket Joins

Hi all,

I'm trying to understand the different HIVE join optimizations. I got the
idea that we're trying to limit the shuffling of key value pairs from
mappers to reducers. But, I cannot grasp the idea behind SMB joins.

For example :

Table A with four columns (user_id, col2, col3, col4) bucketed into 4
buckets on user_id.
Table  B with three columns(user_id, col2, col3) bucketed into 4 buckets on
user_id.

Table A is very large compared to Table B.
The 4 buckets of Table A are files in HDFS and each is made up of multiple
blocks.

When I join Table A and B on user_id, Hive will launch a MR job where :
Each bucket of Table A is read by multiple mappers (according to the number
of blocks making up this bucket/file)
Each one of those mappers will fetch the corresponding bucket (probably
made up of multiple blocks) from Table B (network read ?) and they'll look
for matching records.

Did I get right ?

What happens when each bucket of Table B is made up of multiple blocks ?

Best Regards,