You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ji Zhang <zh...@gmail.com> on 2013/10/17 06:52:50 UTC

FULL OUTER JOIN Two Small Tables More Efficiently in Hive?

Hi All,

I have two tables. One has 2,000,000 rows (150M in 6 files), and the
other has 5,000 rows (400K in 1 file). The join is (approximately) a
full outer join, since the city_id field has only 100 distinct values:

CREATE TABLE prop_total AS
SELECT *
FROM prop_1 a
JOIN prop_2 b
ON a.city_id = b.city_id

As a result, it will produce 150,000,000 rows, about 20G.

But during the execution, Hive only opens 4 mappers. It is very slow,
since in my (tiny) cluster, copying 20G with 100 mappers is usually
under 3 minutes.

I tried to set mapred.max.split.size to 15,000,000, and the mappers
increase to 6, which I want to increase to 100.

Is there a way to achieve this?

One thing I can think of is, say, split up the prop_1 table into 100
small files, and make hive open 100 mappers. But it seems.. odd.

Any suggestions will be appreciated.

Thanks in advance,
Jerry