You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Abbass MAROUNI <ab...@virtualscale.fr> on 2014/05/21 14:26:23 UTC
Sort Merge Bucket Joins
Hi all,
I'm trying to understand the different HIVE join optimizations. I got the
idea that we're trying to limit the shuffling of key value pairs from
mappers to reducers. But, I cannot grasp the idea behind SMB joins.
For example :
Table A with four columns (user_id, col2, col3, col4) bucketed into 4
buckets on user_id.
Table B with three columns(user_id, col2, col3) bucketed into 4 buckets on
user_id.
Table A is very large compared to Table B.
The 4 buckets of Table A are files in HDFS and each is made up of multiple
blocks.
When I join Table A and B on user_id, Hive will launch a MR job where :
Each bucket of Table A is read by multiple mappers (according to the number
of blocks making up this bucket/file)
Each one of those mappers will fetch the corresponding bucket (probably
made up of multiple blocks) from Table B (network read ?) and they'll look
for matching records.
Did I get right ?
What happens when each bucket of Table B is made up of multiple blocks ?
Best Regards,