You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2015/03/12 11:27:08 UTC

Hive map-join

Hello,

I'm executing a join of two tables.
-table1 sizes 130Gb
-table2 sizes 1.5Gb

In HDFS table1 is just one text file and table2 it's ten files.  I'd
like to execute a map-join and load in memory table2

use esp;
set hive.auto.convert.join=true;
#set hive.auto.convert.join.noconditionaltask = true;
#I tried this one to force to execute mapjoin but I think that I don't
know how to use it.
#set hive.auto.convert.join.noconditionaltask.size = 10000000000;

# Although it's not neccesary MAPJOIN, I have tried with and without it.
SELECT /*+ MAPJOIN(table2) */ DISTINCT
t1.c1,
t1.c2,
t2.c3,
t2.c4,
FROM table2 t1
RIGHT JOIN table1 t2
ON (t1.c1 = t2.c3)
AND (t1.c5 = t2.c5)
WHERE t2.xx = 'XX'
LIMIT 10;

This query creates 11 maps. Ten of them takes about 15 seconds and one
of them 2hours. So, I guess that one map loads 130gb to make the join.
Why doesn't Hive split that file? What I'm doing bad with this query?